Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of general-purpose protein understanding—overcoming the task-specific limitations of existing protein language models (pLMs) to enable unified multimodal modeling of protein structure, function, and physicochemical properties. To this end, we propose the first structure-enhanced instruction-tuning framework: (1) a structure-aware embedding module integrating 3D geometric priors; (2) a two-stage warm-up strategy comprising contrastive pretraining followed by structural denoising fine-tuning; (3) a caption-guided initialization mechanism and an MoE-driven collaborative refinement paradigm. We also release Protein-Instruction-1M, the largest publicly available protein instruction dataset to date (1 million samples). Our method achieves state-of-the-art performance on both open-ended generation and closed-book question answering, consistently outperforming proprietary general-purpose LLMs and leading open-source protein-augmented LLMs. Significant gains are observed in functional annotation, physicochemical property prediction, and multi-step biological reasoning.

Technology Category

Application Category

📝 Abstract
Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach incorporates a novel structure-aware module into pLMs to enrich their structural knowledge, and subsequently integrates these enhanced pLMs with large language models (LLMs) to advance protein understanding. In this framework, we propose a novel instruction tuning pipeline. First, we warm up the enhanced pLMs using contrastive learning and structure denoising. Then, caption-based instructions are used to establish a basic understanding of proteins. Finally, we refine this understanding by employing a mixture of experts (MoEs) to capture more complex properties and functional information with the same number of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experiments on both open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.
Problem

Research questions and friction points this paper is trying to address.

Enhancing protein language models with structural knowledge
Achieving general-purpose protein understanding via instruction tuning
Leveraging large datasets to improve protein property prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates structure-aware module into pLMs
Uses contrastive learning and structure denoising
Employs mixture of experts for complex properties
W
Wei Wu
School of Artificial Intelligence and Data Science, University of Science and Technology of China
C
Chao Wang
School of Artificial Intelligence and Data Science, University of Science and Technology of China
Liyi Chen
Liyi Chen
PhD at PolyU, HK
Mingze Yin
Mingze Yin
Zhejiang University
Deep LearningAI for ScienceComputer Vision
Yiheng Zhu
Yiheng Zhu
Zhongguancun Academy & Zhongguancun Institute of Artificial Intelligence
AI for ScienceDeep generative modelsProtein designDrug discovery
K
Kun Fu
Alibaba Cloud Computing
J
Jieping Ye
Alibaba Cloud Computing
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser
Z
Zheng Wang
Alibaba Cloud Computing