Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Traditional homology search for proteins relies on computationally expensive multiple sequence alignments (MSAs), which struggle with highly divergent sequences and complex insertions/deletions (indels), and are decoupled from downstream modeling tasks. This work introduces the first end-to-end differentiable homology search framework that jointly optimizes sequence retrieval and protein fitness prediction. Our method employs a Transformer encoder to learn sequence representations, leverages contrastive learning to construct a discriminative embedding space, and enables dynamic database adaptation via differentiable vector retrieval and efficient approximate nearest-neighbor search. Crucially, it eliminates MSA dependence entirely and is agnostic to both downstream tasks and model architectures. On protein fitness prediction, it achieves state-of-the-art performance, accelerates inference by two orders of magnitude, and significantly improves robustness to high sequence divergence and intricate indels.

Technology Category

Application Category

📝 Abstract

Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training models on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions&deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture- and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time -- offering a scalable alternative to alignment-centric approaches.

Problem

Research questions and friction points this paper is trying to address.

Retrieve homologous protein sequences efficiently

Overcome limitations of MSA-based retrieval methods

Enable end-to-end differentiable protein fitness prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end differentiable protein homology search

Efficient vector search for faster performance

Architecture- and task-agnostic flexible framework

🔎 Similar Papers

A PLMs based protein retrieval framework