🤖 AI Summary
Protein function understanding faces challenges including rigid classification paradigms, underutilization of 3D structural information, and the absence of systematic QA evaluation benchmarks. To address these, we propose the first multimodal large language model (MLLM) for protein understanding, unifying amino acid sequence and 3D structure modeling via early-fusion sequence–structure encoding and a protein–text cross-modal adapter, enabling natural language question answering. Our model integrates an enhanced ProteinMPNN encoder, a cross-attention adapter, and an LLaMA3 decoder, trained via LoRA fine-tuning with encoder freezing. On two benchmark datasets, it achieves statistically significant improvements over baselines in both automated metrics and expert evaluations. Notably, it demonstrates zero-shot conversational generalization—first in protein science—and exhibits strong robustness and interpretability on structure-aware QA tasks.
📝 Abstract
Proteins play a pivotal role in living organisms, yet understanding their functions presents significant challenges, including the limited flexibility of classification-based methods, the inability to effectively leverage spatial structural information, and the lack of systematic evaluation metrics for protein Q&A systems. To address these limitations, we propose Prot2Chat, a novel framework that integrates multimodal protein representations with natural language through a unified module, enabling large language model (LLM)-driven answer generation. Our model incorporates a modified ProteinMPNN encoder, which encodes protein sequence and structural information in a unified manner, a protein-text adapter with cross-attention mechanisms, and a LLaMA3 decoder. To optimize training efficiency, we freeze the encoder and employ LoRA techniques for the decoder. We conducted experiments on two datasets, both automated metrics and expert evaluations demonstrate the superior performance of our model. Furthermore, zero-shot prediction results highlight its strong generalization capabilities. This framework offers a promising solution for bridging protein domain knowledge with natural language understanding, paving the way for transformative advancements in protein-related research.