Protein as a Second Language for LLMs

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the zero-shot interpretation of unknown protein functions without task-specific adapters or supervised fine-tuning. We propose the “Proteins as a Second Language” framework, which maps amino acid sequences into symbolic representations compatible with large language models (LLMs), and construct a bilingual protein question-answering corpus comprising 79,000 sequence–question–answer triplets supporting attribute prediction, descriptive understanding, and reasoning. Leveraging only in-context learning and prompt engineering—without parameter updates—we achieve cross-model generalization: on GPT-4 and multiple open-weight LLMs, ROUGE-L scores improve by an average of 7.0% (up to 17.2%), surpassing specialized protein models. This is the first demonstration that general-purpose LLMs, driven purely by prompting, can match or exceed the performance of supervised fine-tuned domain-specific models on protein functional understanding tasks—establishing a novel paradigm for biological language modeling.

Technology Category

Application Category

📝 Abstract
Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.
Problem

Research questions and friction points this paper is trying to address.

Deciphering functions of unseen protein sequences without task-specific training
Reformulating amino-acid sequences as interpretable language for LLMs
Enabling zero-shot protein function prediction through contextual exemplars
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates amino-acid sequences as interpretable symbolic language
Adaptively constructs sequence-question-answer triples without training
Uses bilingual protein-QA corpus for zero-shot functional understanding
X
Xinhui Chen
Wuhan University
Zuchao Li
Zuchao Li
Wuhan University
Natural Language ProcessingMachine Learning
M
Mengqi Gao
Wuhan University
Y
Yufeng Zhang
Wuhan University
C
Chak Tou Leong
Hong Kong Polytechnic University
H
Haoyang Li
Stanford University
J
Jiaqi Chen
Stanford University, Topify AI