🤖 AI Summary
Protein adaptive prediction remains challenging under few-shot conditions. This paper introduces PRIMO, the first framework unifying in-context learning, test-time training, and preference learning within a Transformer architecture. PRIMO leverages masked language modeling for pretraining and jointly encodes protein sequences, zero-shot predictions, and sparse experimental labels; it is optimized via a preference-based loss function, eliminating the need for task-specific large-scale labeled datasets. The framework enables rapid cross-family and cross-mutation-type adaptation. Extensive evaluation across multiple protein families and functional phenotypes demonstrates that PRIMO significantly outperforms both zero-shot and fully supervised baselines. These results validate the effectiveness and generalizability of the “pretraining + test-time adaptation” paradigm in low-data regimes.
📝 Abstract
Accurately predicting protein fitness with minimal experimental data is a persistent challenge in protein engineering. We introduce PRIMO (PRotein In-context Mutation Oracle), a transformer-based framework that leverages in-context learning and test-time training to adapt rapidly to new proteins and assays without large task-specific datasets. By encoding sequence information, auxiliary zero-shot predictions, and sparse experimental labels from many assays as a unified token set in a pre-training masked-language modeling paradigm, PRIMO learns to prioritize promising variants through a preference-based loss function. Across diverse protein families and properties-including both substitution and indel mutations-PRIMO outperforms zero-shot and fully supervised baselines. This work underscores the power of combining large-scale pre-training with efficient test-time adaptation to tackle challenging protein design tasks where data collection is expensive and label availability is limited.