🤖 AI Summary
This work addresses the zero-shot interpretation of unknown protein functions without task-specific adapters or supervised fine-tuning. We propose the “Proteins as a Second Language” framework, which maps amino acid sequences into symbolic representations compatible with large language models (LLMs), and construct a bilingual protein question-answering corpus comprising 79,000 sequence–question–answer triplets supporting attribute prediction, descriptive understanding, and reasoning. Leveraging only in-context learning and prompt engineering—without parameter updates—we achieve cross-model generalization: on GPT-4 and multiple open-weight LLMs, ROUGE-L scores improve by an average of 7.0% (up to 17.2%), surpassing specialized protein models. This is the first demonstration that general-purpose LLMs, driven purely by prompting, can match or exceed the performance of supervised fine-tuned domain-specific models on protein functional understanding tasks—establishing a novel paradigm for biological language modeling.
📝 Abstract
Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.