🤖 AI Summary
Predicting antigen-binding specificity of T-cell receptors (TCRs) remains challenging due to their extreme sequence diversity, particularly within the hypervariable CDR3 region.
Method: We introduce tcrLM, a lightweight, TCR-specific language model trained via (1) a novel masked language modeling framework tailored to CDR3 sequences; (2) virtual adversarial training (VAT) to enhance generalization and robustness; and (3) large-scale pretraining on over 100 million real-world TCR sequences—enabling the first systematic characterization of amino acid biochemical property–position preferences in CDR3.
Contribution/Results: After fine-tuning, tcrLM consistently outperforms existing TCR–antigen binding predictors and general-purpose protein language models across multiple independent, external, and COVID-19-specific benchmarks. Moreover, its prediction scores significantly correlate with immunotherapy response and clinical outcomes in melanoma patients, establishing a new paradigm for deciphering TCR recognition mechanisms and advancing personalized immunotherapy.
📝 Abstract
The anti-cancer immune response relies on the bindings between T-cell receptors (TCRs) and antigens, which elicits adaptive immunity to eliminate tumor cells. This ability of the immune system to respond to novel various neoantigens arises from the immense diversity of TCR repository. However, TCR diversity poses a significant challenge on accurately predicting antigen-TCR bindings. In this study, we introduce a lightweight masked language model, termed tcrLM, to address this challenge. Our approach involves randomly masking segments of TCR sequences and training tcrLM to infer the masked segments, thereby enabling the extraction of expressive features from TCR sequences. To further enhance robustness, we incorporate virtual adversarial training into tcrLM. We construct the largest TCR CDR3 sequence set with more than 100 million distinct sequences, and pretrain tcrLM on these sequences. The pre-trained encoder is subsequently applied to predict TCR-antigen binding specificity. We evaluate model performance on three test datasets: independent, external, and COVID-19 test set. The results demonstrate that tcrLM not only surpasses existing TCR-antigen binding prediction methods, but also outperforms other mainstream protein language models. More interestingly, tcrLM effectively captures the biochemical properties and positional preference of amino acids within TCR sequences. Additionally, the predicted TCR-neoantigen binding scores indicates the immunotherapy responses and clinical outcomes in a melanoma cohort. These findings demonstrate the potential of tcrLM in predicting TCR-antigen binding specificity, with significant implications for advancing immunotherapy and personalized medicine.