🤖 AI Summary
Existing protein–text question answering methods rely on static datasets and struggle to integrate into expert biological workflows, resulting in limited fine-grained understanding and poor out-of-distribution (OOD) generalization. To address these limitations, this work proposes 2D-ProteinRAG, a novel framework that, for the first time, embeds standard bioinformatics pipelines—such as BLAST—into a retrieval-augmented generation (RAG) system. It introduces a two-dimensional filtering mechanism comprising horizontal attribute alignment and vertical homologous semantic denoising to enhance the accuracy and robustness of functional interpretations. Extensive evaluations demonstrate that 2D-ProteinRAG significantly outperforms both fine-tuned baselines and other RAG approaches across in-distribution and diverse biological OOD benchmarks, underscoring its practical utility and state-of-the-art performance in real-world scientific settings.
📝 Abstract
Protein-Text Question Answering (QA) is crucial for interpreting biological sequences through natural language. The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) that efficiently leverages biological databases and facilitates reasoning offers a potent approach for it. However, constrained by the standard RAG pipeline, these models often rely on curated, static datasets instead of expert-proven biological workflows, lacking the fine-grained information processing and struggling to generalize to novel (OOD) proteins. To bridge this gap, we propose 2D-ProteinRAG, a novel framework that empowers LLMs to operate within the gold-standard biological research workflow (BLAST). To further extract high-quality information from noisy retrieval contexts, we introduce a dual-dimensional (2D) filtering strategy following the expert analytical paradigms. Horizontal Fine-grained Attribute Alignment utilizes a lightweight, intent-aware discriminative filter to prune irrelevant metadata and align database entries with specific user queries. Vertical Homology-based Semantic Denoising resolves functional contradictions and redundancy across multiple homologs via hierarchical clustering. Extensive evaluations on both In-Distribution and diverse biological OOD benchmarks demonstrate that 2D-ProteinRAG consistently achieves state-of-the-art performance, outperforming fine-tuned baselines and other RAG methods. Our results validate the framework's robustness and scalability, providing a practical solution for interpreting protein functions in real-world scientific scenarios.