Few-shot Protein Fitness Prediction via In-context Learning and Test-time Training

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Protein adaptive prediction remains challenging under few-shot conditions. This paper introduces PRIMO, the first framework unifying in-context learning, test-time training, and preference learning within a Transformer architecture. PRIMO leverages masked language modeling for pretraining and jointly encodes protein sequences, zero-shot predictions, and sparse experimental labels; it is optimized via a preference-based loss function, eliminating the need for task-specific large-scale labeled datasets. The framework enables rapid cross-family and cross-mutation-type adaptation. Extensive evaluation across multiple protein families and functional phenotypes demonstrates that PRIMO significantly outperforms both zero-shot and fully supervised baselines. These results validate the effectiveness and generalizability of the “pretraining + test-time adaptation” paradigm in low-data regimes.

Technology Category

Application Category

📝 Abstract
Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.
Problem

Research questions and friction points this paper is trying to address.

Predicts protein fitness with minimal experimental data
Adapts to new proteins without large datasets
Outperforms baselines across diverse protein families
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based framework with in-context learning
Unified token encoding of sequences and predictions
Preference-based loss for prioritizing protein variants