Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations

📅 2025-04-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address data scarcity and poor generalization to novel epitopes in TCR–pMHC binding prediction, this paper proposes a cross-modal deep learning framework. Methodologically, it is the first to jointly leverage ESM-1b for encoding TCR-β chain sequences and MolFormer for encoding peptide SMILES strings, enabling synergistic representation learning across biological sequences and small-molecule chemistry. A robust negative sampling strategy is further introduced to mitigate label bias and enhance zero-shot and few-shot robustness. Evaluated on multiple benchmarks, the framework significantly outperforms state-of-the-art baselines—including ChemBERTa, TITAN, and NetTCR—achieving an 8.2% absolute improvement in zero-shot AUC and a 15.6% gain in embedding-space clustering purity. This work establishes a scalable and interpretable paradigm for low-resource immune recognition modeling.

Technology Category

Application Category

📝 Abstract

Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a deep learning framework that combines large-scale protein language models with chemical representations of peptides. By encoding TCR {eta}-chain sequences using ESM-1b and transforming peptide sequences into SMILES strings processed by MolFormer, LANTERN captures rich biological and chemical features critical for TCR-peptide recognition. Through extensive benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR, LANTERN demonstrates superior performance, particularly in zero-shot and few-shot learning scenarios. Our model also benefits from a robust negative sampling strategy and shows significant clustering improvements via embedding analysis. These results highlight the potential of LANTERN to advance TCR-pMHC binding prediction and support the development of personalized immunotherapies.

Problem

Research questions and friction points this paper is trying to address.

Predicts TCR-peptide binding specificity for immunotherapy

Improves generalization in data-scarce and novel epitope scenarios

Combines protein language models with chemical peptide representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines protein language models with chemical peptide representations

Encodes TCR sequences using ESM-1b and peptides via SMILES with MolFormer

Demonstrates superior performance in zero-shot and few-shot learning scenarios

🔎 Similar Papers

tcrLM: a lightweight protein language model for predicting T cell receptor and epitope binding specificity