A Machine Learning Tool Uncloaks the Hidden Sources of Cancer Cells

Share this post on:

A scientist looks at DNA molecules and computer code

Machine learning could soon help physicians determine the source of cancers with no known origin.

Most tumors continuously shed cells; many die, but some survive, circulating throughout the body, settling elsewhere, and forming new, metastatic tumors. To effectively treat patients with metastatic cancer, physicians must pinpoint where these cells originally came from, but in three to five percent of cases a standard diagnostic workup fails to do so.1 This disease state is called cancer of unknown primary (CUP), and prognosis for those who have it is often grim. In these cases, doctors choose therapies according to the cancer’s most likely original site, but they may choose incorrectly. Alternatively, they can use broader treatments that are less effective and have more side effects compared to targeted ones. As a result, these cancers often quickly progress, and patient survival is only six to 16 months.2

Intae Moon, a PhD student at the Massachusetts Institute of Technology (MIT) and the Dana-Farber Cancer Institute in Alexander Gusev’s laboratory, wanted to come up with a way to easily diagnose CUP’s primary site. Next generation sequencing (NGS) can identify a tumor’s mutations and help determine the cancer type, but the amount of mutation data generated can be too vast for a physician to sift through during an initial diagnosis. Therefore, clinicians typically use NGS only after they identify the cancer type to pinpoint specific mutations for targeted therapies. NGS has also not been well-investigated for clinical diagnosis and prognosis of CUP. To address this, Moon and his colleagues developed a machine learning model called OncoNPC (Oncology NGS-based Primary cancer-type Classifier) to efficiently sift through large amounts of complex mutation data.3 The study, published in Nature Medicine, uses an algorithm called XGBoost, which zooms in on parts of the DNA to search for patterns of genetic mutations most linked to various types of cancers. 

Using NGS data from 36,445 tumor samples with known primary cancers, the researchers ran OncoNPC to search for genetic mutations, copy number alterations, and mutational signatures. The model also included patient age and biological sex from their electronic health records. The team trained the model on data from three different cancer centers across the US to link certain genetic signatures with one of 22 different cancer types. After training, they tested OncoNPC on data they previously removed from the training sequence. “OncoNPC managed to correctly identify the origins of known tumors about 80 percent of the time,” said Moon in an email. “When we focused on high-confidence predictions, which made up around 65 percent of all cases, the model’s accuracy jumped to an impressive 95 percent,” though it was slightly less precise with rare cancers. 

The researchers then applied OncoNPC to tumors from 971 patients with CUP who were treated at the Dana-Farber Cancer Institute. OncoNCP classified 41.2 percent of the CUP tumors with high confidence. The team also examined and determined which genetic features were most salient for identifying each cancer type, information that is clinically and biologically valuable given the mysterious nature of CUP tumors.

Because the actual locations of the primary tumor sites were unknown, the model’s accuracy was determined by checking the results against each person’s NGS data to see if they had a genetic predisposition for a certain type of cancer. The researchers found that the model’s predictions closely matched the cancer type indicated most strongly by these inherited mutations.

To see if treating a patient according to OncoNPC predictions would increase their survival, the team retrospectively looked at 158 patients with CUP who received their first treatment at Dana-Farber. A certified oncologist manually reviewed patients’ charts to see whether they were treated according to the OncoNPC cancer type predictions. The researchers found that patients with CUP whose first palliative treatment—based on their physician’s educated guess—was in line with their OncoNPC prediction had significantly better survival than those who were not treated according to OncoNPC.

The researchers hope to expand the model to work with more patients and cancers. “While the model did show decent performance for patients of other ethnic backgrounds, we recognize the need for more in-depth research to confirm that it’s effective across a diverse range of patients,” said Moon. In addition, the model looked for only the 22 most common cancer types; if the CUP comes from another site, the model could not identify it.

Still, OncoNPC could soon help doctors make difficult treatment decisions. “If a hospital uses NGS tumor sequencing, they should be able to incorporate it as an additional source of information for the oncologist,” said study senior author Alexander Gusev in an email.

“What is unique in this paper is that they used routine testing data that is used in the clinic,” said Edwin Cuppen, a cancer biologist at University Medical Center Utrecht, who was not involved in this study. “These are a separate class of tumors. We don’t understand what makes these tumors atypical, but we do know that the prognosis of these patients typically is much worse. So, a prediction for a big chunk of the patients is a big step forward.”


  1. Pavlidis N, Fizazi K. Carcinoma of unknown primary (CUP). Crit Rev Oncol/Heamtol. 2009;69(3):271-8.
  2. Varadhachary GR, Raber MN. Cancer of unknown primary site. N Engl J Med. 2014;371(8):757-65.
  3. Moon I, et al. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nature Medicine. 2023;29:2057-2067.
Share this post on:

Leave a Reply

Your email address will not be published. Required fields are marked *