A PLMs based protein retrieval framework

📅 2024-07-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing protein retrieval methods (e.g., BLAST) rely heavily on sequence similarity, limiting their ability to identify evolutionarily distant proteins with conserved structure or function. This work introduces the first dense retrieval framework for proteins based on protein language models (PLMs), leveraging ESM and ProtBERT to generate semantically informed embeddings that explicitly decouple sequence similarity bias. The approach integrates these embeddings with efficient approximate nearest neighbor indexing (e.g., FAISS) to enable scalable vector-based retrieval. Its core contribution is the first application of PLM-derived embeddings for functional, cross-sequence protein retrieval—enabling semantic matching at both structural and functional levels. Evaluated on multiple benchmarks, the method significantly improves recall of remote homologs and functionally similar proteins, successfully recovering critical functional proteins missed by BLAST. It establishes a new paradigm for protein functional annotation and novel target discovery.

Technology Category

Application Category

📝 Abstract
Protein retrieval, which targets the deconstruction of the relationship between sequences, structures and functions, empowers the advancing of biology. Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based algorithm, has proved the efficiency of this field. Despite the existing tools for protein retrieval, they prioritize sequence similarity and probably overlook proteins that are dissimilar but share homology or functionality. In order to tackle this problem, we propose a novel protein retrieval framework that mitigates the bias towards sequence similarity. Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space, thereby enhancing the representation capacity for subsequent analysis. Subsequently, an accelerated indexed vector database is constructed to facilitate expedited access and retrieval of dense vectors. Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins. Moreover, this approach enables the identification of proteins that conventional methods fail to uncover. This framework will effectively assist in protein mining and empower the development of biology.
Problem

Research questions and friction points this paper is trying to address.

Protein Retrieval
Sequence Similarity
Functional Similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Protein Language Model
Digital Representation
Smart Database
🔎 Similar Papers
No similar papers found.
Yuxuan Wu
Yuxuan Wu
Embry-Riddle Aeronautical University
CompositeProcess designComplex system modeling
Xiao Yi
Xiao Yi
The Chinese University of Hong Kong
Security
Y
Yang Tan
East China University of Science and Technology, China, Shanghai
H
Huiqun Yu
East China University of Science and Technology, China, Shanghai
G
Guisheng Fan
East China University of Science and Technology, China, Shanghai