Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations

📅 2025-04-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address data scarcity and poor generalization to novel epitopes in TCR–pMHC binding prediction, this paper proposes a cross-modal deep learning framework. Methodologically, it is the first to jointly leverage ESM-1b for encoding TCR-β chain sequences and MolFormer for encoding peptide SMILES strings, enabling synergistic representation learning across biological sequences and small-molecule chemistry. A robust negative sampling strategy is further introduced to mitigate label bias and enhance zero-shot and few-shot robustness. Evaluated on multiple benchmarks, the framework significantly outperforms state-of-the-art baselines—including ChemBERTa, TITAN, and NetTCR—achieving an 8.2% absolute improvement in zero-shot AUC and a 15.6% gain in embedding-space clustering purity. This work establishes a scalable and interpretable paradigm for low-resource immune recognition modeling.

Technology Category

Application Category

📝 Abstract
Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a deep learning framework that combines large-scale protein language models with chemical representations of peptides. By encoding TCR {eta}-chain sequences using ESM-1b and transforming peptide sequences into SMILES strings processed by MolFormer, LANTERN captures rich biological and chemical features critical for TCR-peptide recognition. Through extensive benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR, LANTERN demonstrates superior performance, particularly in zero-shot and few-shot learning scenarios. Our model also benefits from a robust negative sampling strategy and shows significant clustering improvements via embedding analysis. These results highlight the potential of LANTERN to advance TCR-pMHC binding prediction and support the development of personalized immunotherapies.
Problem

Research questions and friction points this paper is trying to address.

Predicts TCR-peptide binding specificity for immunotherapy
Improves generalization in data-scarce and novel epitope scenarios
Combines protein language models with chemical peptide representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines protein language models with chemical peptide representations
Encodes TCR sequences using ESM-1b and peptides via SMILES with MolFormer
Demonstrates superior performance in zero-shot and few-shot learning scenarios
🔎 Similar Papers
No similar papers found.
C
Cong Qi
New Jersey Institute of Technology
H
Hanzhang Fang
New Jersey Institute of Technology
S
Siqi Jiang
New Jersey Institute of Technology
T
Tianxing Hu
New Jersey Institute of Technology
Zhi Wei
Zhi Wei
Dist. Prof of CS and Statistics, NJIT; Adj. Prof. at UPenn.; Fellow of IEEE, AAAS
Statistical modelingMachine LearningBioinformaticsGenomicsStatistical Genetics