Driving Accurate Allergen Prediction with Protein Language Models and Generalization-Focused Evaluation

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses three practical challenges in allergen protein identification: (1) zero-shot recognition of novel allergens absent from training data; (2) fine-grained discrimination between allergenic and non-allergenic proteins exhibiting high sequence similarity; and (3) assessing the impact of single-point mutations on allergenicity. To this end, we propose Applm, a computational framework that pioneers the integration of the billion-parameter xTrimoPGLM protein language model into allergen prediction—leveraging its strong generalizable representations learned from trillion-token pretraining—via transfer learning and sequence feature fusion. We further construct the first comprehensive benchmark tailored to real-world scenarios, covering zero-shot prediction, fine-grained classification, and mutation impact analysis. Extensive experiments demonstrate that Applm significantly outperforms seven state-of-the-art methods across seven challenging evaluation metrics. The code and benchmark dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Allergens, typically proteins capable of triggering adverse immune responses, represent a significant public health challenge. To accurately identify allergen proteins, we introduce Applm (Allergen Prediction with Protein Language Models), a computational framework that leverages the 100-billion parameter xTrimoPGLM protein language model. We show that Applm consistently outperforms seven state-of-the-art methods in a diverse set of tasks that closely resemble difficult real-world scenarios. These include identifying novel allergens that lack similar examples in the training set, differentiating between allergens and non-allergens among homologs with high sequence similarity, and assessing functional consequences of mutations that create few changes to the protein sequences. Our analysis confirms that xTrimoPGLM, originally trained on one trillion tokens to capture general protein sequence characteristics, is crucial for Applm's performance by detecting important differences among protein sequences. In addition to providing Applm as open-source software, we also provide our carefully curated benchmark datasets to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Accurate prediction of allergen proteins using language models
Identifying novel allergens without similar training examples
Differentiating allergens from non-allergens in similar sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages xTrimoPGLM protein language model
Outperforms seven state-of-the-art methods
Detects important protein sequence differences
🔎 Similar Papers
No similar papers found.
B
Brian Shing-Hei Wong
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
J
Joshua Mincheol Kim
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
S
Sin-Hang Fung
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China; Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China
Q
Qing Xiong
Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China
K
Kelvin Fu-Kiu Ao
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
Junkang Wei
Junkang Wei
University of Michigan
BioinformaticsProteomicsProtein-molecule interactionMachine Learning
R
Ran Wang
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
D
Dan Michelle Wang
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
J
Jingying Zhou
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
Bo Feng
Bo Feng
Professor of Communication, University of California, Davis
Technologically-mediated CommunicationSupportive CommunicationIntercultural CommunicationPhysician-patient Interaction
A
Alfred Sze-Lok Cheng
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
K
Kevin Y. Yip
Center for Data Sciences, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, USA; Cancer Genome and Epigenetics Program, NCI-Designated Cancer Center, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, USA; Center for Neurologic Diseases, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, USA; Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
Stephen Kwok-Wing Tsui
Stephen Kwok-Wing Tsui
The Chinese University of Hong Kong
molecular biologygeneticsgenomicsbioinformaticsvirology
Q
Qin Cao
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China; Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China; Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China