Addressing Power and Pitfalls in Machine Learning Neoantigen Prediction

Share this post on:

Big data visualization of transfer learning concept. 

Scientists train machine learning prediction models on massive datasets and fine-tune their predictions with transfer learning, which re-uses knowledge learned on one task to boost the power of a related task. 

Antigen presentation by major histocompatibility complex (MHC) proteins is the caller ID of the immune system. On the cell surface, MHCs display peptides derived from cellular components or foreign sources such as viruses, bacteria, or parasites. Peptide display allows the adaptive immune system to recognize and respond to self or non-self antigens.1 Just as someone might suspect a spam call from an unknown number, T cells that see immunogenic peptide-MHC complexes (pMHCs) can alert the immune response to suspicious antigens. Immune activation can then eliminate unwanted or malicious peptide sources. 

MHC-mediated peptide presentation is essential for preventing and fighting infection, and it helps T cells intercept abnormalities, such as cancer cells. Researchers hunt for novel cancer-specific antigens called neoantigens to develop personalized immunotherapies, but this hunt is often hindered by experimental bottlenecks such as low sensitivity and throughput.2 Computational biologists, such as Rachel Karchin from Johns Hopkins University, turn towards machine learning prediction models to overcome these experimental limitations. 

“Our group has really been trying to push the state of neoantigen prediction forward,” said Karchin. Although many machine learning tools for predicting antigen presentation exist, predicting immunogenicity remains a challenge.2 In their latest work published in Nature Machine Intelligence,3 Karchin and her team partnered with oncology and immunology experts Kellie Nicole Smith and Valsamo Anagnostou, to develop a transfer learning approach that predicts which neoantigen sequences will elicit an immune response, classifying them as neoepitopes.

See Also “Simplifying the Search for Drug Targets”

Their method, called BigMHC, involves seven deep neural networks that the researchers first trained and tested using mass spectrometry datasets of pMHC fragments, which are indicative of neoantigen presentation. They then transferred what their presentation prediction model learned onto data from T cell receptor (TCR) response assays to predict immunogenicity. “Because all immunogenic neoepitopes are presented but not all presented [neoantigens] are immunogenic, this is a fine-tuning task,” said first author Benji Albert, who was an undergraduate researcher in Karchin’s laboratory when leading this work. “We’re trying to not change the whole network, but just the last few projections. It’s the same training task actually, it’s just modifying the last few layers.” 

The researchers found that their BigMHC model yields powerful presentation and immunogenicity prediction, but they also highlighted limitations to machine learning-based predictions. “This is just part of the adaptive immune response, this peptide-MHC-T cell interaction,” Karchin said. “For a T cell and a tumor cell to really recognize each other, there are many other ligand-receptor interactions that are important, and they should be incorporated into this type of prediction.”

Scientists also face the challenge of biological redundancy when developing computational models for neoantigen prediction. “Many of the tools we apply nowadays in biology to do machine learning, they come from the computational field,” said Morten Nielsen, a computational biologist from the Technical University of Denmark, who was not involved in this study but who also develops computational tools for antigen prediction, including NetMHCpan-4.1, which Karchin and her team compared with BigMHC in their study. “Working on biological data is very different compared to working on general data used in computer science,” Nielsen said.

The way that a machine learning model learns and predicts is informed by the way data points are grouped and partitioned. Unlike typical discreet data points in computer science, living systems often rely on redundancies in proteins and other biological molecules, which complicates the challenge of data leakage in machine learning.4 When training and test datasets contain overlapping data or data points that share significant similarity, researchers may overestimate a model’s performance.  

In the case of neoantigens, each mass spectrometry-derived sequence that researchers use to train and test machine learning models represents a piece of an antigen and its presenting MHC molecule, called an HLA in humans. Every person has a genetically unique set of HLAs, and each HLA recognizes and presents peptides for antigen recognition.2 

“The same peptide can be seen in many contexts,” Nielsen said. “If you take them as being two data points and put one of them into the test dataset and the other into the training, then the method can learn by heart. Even though they are in principle different data points because they come from different HLAs, they are still the same data point because they obey the same rules, and the rules you learn from one can be applied to the other.”

See Also “Move Over, Proteins! Exploring Lipids in Adaptive Immunity”

Karchin’s team considered each unique pMHC pair as a separate data point. They were careful to deduplicate pMHC pairs to ensure that there was no overlap in their model’s training and test data points but noted that they did not stratify their data by peptides separately. “I think the model might have potential, but they need to demonstrate that on a proper dataset where they have dealt with this redundancy issue,” Nielsen said.

Karchin and Albert responded to this concern. “Although there is no pMHC overlap, we took a look into the extent of peptide overlap and found that of 937 instances in the neoepitope test (benchmark) dataset, BigMHC had seen 28 negatives and 2 positive peptides in its training,” the researchers said via email. “The standard in the field is to consider a unique instance as the whole peptide-MHC complex, but even disregarding MHCs, the peptide overlap in BigMHC neoepitope training and test sets is negligible.” 

Nielsen also highlighted that this is a challenge seen often with machine learning prediction models in biology. “More than half of the papers published in this field of deep immuno-informatics, they suffer from this issue,” he said. “People are not aware of the data challenges in biology if you come from computer science where data is just data.” 

Despite these growing pains in the rapidly expanding biological application of machine learning, neoantigen prediction tools attack the important problem of large-scale discovery that cannot currently be achieved experimentally. BigMHC is the first published transfer learning tool for predicting neoantigen immunogenicity, opening the door to future machine learning methods that may improve personalized immunotherapy development. 


  1. Wieczorek M, et al. Major histocompatibility complex (MHC) class I and MHC class II proteins: Conformational plasticity in antigen presentation. Front Immunol. 2017;8:292. 
  2. Gfeller D, et al. Contemplating immunopeptidomes to better predict them. Semin Immunol. 2023;66:101708.
  3. Albert BA, et al. Deep neural networks predict class I major histocompatibility complex epitope presentation and transfer learn neoepitope immunogenicity. Nat Mach Intell. 2023;5(8):861-72.
  4. Fang J. The role of data imbalance bias in the prediction of protein stability change upon mutation. PLoS One. 2023;18(3):e0283727.

This article was updated on December 15th, 2023 to clarify Nielsen’s work on NetMHCpan-4.1.

Share this post on:

Leave a Reply

Your email address will not be published. Required fields are marked *