Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Protein adaptive prediction remains challenging under few-shot conditions. This paper introduces PRIMO, the first framework unifying in-context learning, test-time training, and preference learning within a Transformer architecture. PRIMO leverages masked language modeling for pretraining and jointly encodes protein sequences, zero-shot predictions, and sparse experimental labels; it is optimized via a preference-based loss function, eliminating the need for task-specific large-scale labeled datasets. The framework enables rapid cross-family and cross-mutation-type adaptation. Extensive evaluation across multiple protein families and functional phenotypes demonstrates that PRIMO significantly outperforms both zero-shot and fully supervised baselines. These results validate the effectiveness and generalizability of the “pretraining + test-time adaptation” paradigm in low-data regimes.

Technology Category

Application Category

📝 Abstract

Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.

Problem

Research questions and friction points this paper is trying to address.

Predicts protein fitness with minimal experimental data

Adapts to new proteins without large datasets

Outperforms baselines across diverse protein families

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based framework with in-context learning

Unified token encoding of sequences and predictions

Preference-based loss for prioritizing protein variants

🔎 Similar Papers

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning