ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

📅 2024-08-21

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Traditional protein structure and function analysis pipelines are complex and inefficient, hindering mechanistic biological discovery, drug development, and protein engineering. To address this, we introduce ProMLLM—the first multimodal large language model specifically designed for proteins—capable of jointly processing amino acid sequences and 3D structural inputs to unify protein property prediction, structural understanding, and natural language question answering. We propose a protein-specific multimodal instruction-tuning paradigm, releasing a high-quality dataset comprising 132,000 annotated samples for both property prediction and QA. A structure-aware cross-modal semantic alignment mechanism is further designed to bridge sequence and geometry representations. ProMLLM integrates the ESM-2 sequence encoder and SE(3)-Transformer structural encoder, whose outputs are linearly adapted and fed into Qwen/Qwen2 for end-to-end instruction tuning. On protein QA benchmarks, ProMLLM achieves over 35% improvement in semantic matching (BLEU-4/ROUGE-L) and domain-specific accuracy over general-purpose and unimodal baselines. Code and dataset are publicly available.

Technology Category

Application Category

📝 Abstract

Understanding biological processes, drug development, and biotechnological advancements requires a detailed analysis of protein structures and functions, a task that is inherently complex and time-consuming in traditional protein research. To streamline this process, we introduce ProteinGPT, a state-of-the-art multimodal large language model for proteins that enables users to upload protein sequences and/or structures for comprehensive analysis and responsive inquiries. ProteinGPT integrates protein sequence and structure encoders with linear projection layers to ensure precise representation adaptation and leverages a large language model (LLM) to generate accurate, contextually relevant responses. To train ProteinGPT, we constructed a large-scale dataset of 132,092 proteins, each annotated with 20-30 property tags and 5-10 QA pairs per protein, and optimized the instruction-tuning process using GPT-4o. Experiments demonstrate that ProteinGPT effectively generates informative responses to protein-related questions, achieving high performance on both semantic and lexical metrics and significantly outperforming baseline models and general-purpose LLMs in understanding and responding to protein-related queries. Our code and data are available at https://github.com/ProteinGPT/ProteinGPT.

Problem

Research questions and friction points this paper is trying to address.

Predicts protein properties from sequences and structures

Understands and analyzes complex protein data efficiently

Generates accurate responses to protein-related inquiries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM integrates sequence and structure encoders

Large-scale dataset with annotated property tags

Instruction-tuning optimized using GPT-4o

🔎 Similar Papers

ProtChatGPT: Towards Understanding Proteins with Large Language Models