Protein as a Second Language for LLMs

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the zero-shot interpretation of unknown protein functions without task-specific adapters or supervised fine-tuning. We propose the “Proteins as a Second Language” framework, which maps amino acid sequences into symbolic representations compatible with large language models (LLMs), and construct a bilingual protein question-answering corpus comprising 79,000 sequence–question–answer triplets supporting attribute prediction, descriptive understanding, and reasoning. Leveraging only in-context learning and prompt engineering—without parameter updates—we achieve cross-model generalization: on GPT-4 and multiple open-weight LLMs, ROUGE-L scores improve by an average of 7.0% (up to 17.2%), surpassing specialized protein models. This is the first demonstration that general-purpose LLMs, driven purely by prompting, can match or exceed the performance of supervised fine-tuned domain-specific models on protein functional understanding tasks—establishing a novel paradigm for biological language modeling.

Technology Category

Application Category

📝 Abstract

Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence-question-answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process, we curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.

Problem

Research questions and friction points this paper is trying to address.

Deciphering functions of unseen protein sequences without task-specific training

Reformulating amino-acid sequences as interpretable language for LLMs

Enabling zero-shot protein function prediction through contextual exemplars

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates amino-acid sequences as interpretable symbolic language

Adaptively constructs sequence-question-answer triples without training

Uses bilingual protein-QA corpus for zero-shot functional understanding

🔎 Similar Papers

ProtChatGPT: Towards Understanding Proteins with Large Language Models

2024-02-15arXiv.orgCitations: 7

Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding

2024-10-04arXiv.orgCitations: 1

Authors to Follow