🤖 AI Summary
Protein large language models (Protein LLMs) lack a systematic, comprehensive survey, hindering holistic understanding of their design principles, evaluation practices, and application scope.
Method: We introduce the first unified taxonomy covering architectural design, training data curation, evaluation metrics, and application domains—synthesizing over 100 studies. Our analysis centers on four key paradigms: self-supervised pretraining, multi-task fine-tuning, cross-modal alignment, and interpretability analysis.
Contribution/Results: We propose the first end-to-end, structured framework for Protein LLMs; establish an open-source knowledge base and a dynamically updated resource hub (hosted on GitHub), offering methodological guidelines and benchmarking protocols for protein structure prediction, functional annotation, and engineering design; and clarify the foundational role of Protein LLMs as enabling tools accelerating discovery in protein science. This work bridges critical gaps between theory, implementation, and real-world deployment, while identifying persistent challenges—including data scarcity, evaluation inconsistency, and limited generalizability across biological contexts.
📝 Abstract
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.