Steering Generative Models with Experimental Data for Protein Fitness Optimization

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the challenge of protein fitness optimization under low-throughput wet-lab constraints—where only hundreds of sequence-fitness pairs are available. We propose a plug-and-play guidance framework based on discrete diffusion models. Methodologically, we conduct the first systematic evaluation of classifier guidance versus posterior sampling for protein generation and introduce an adaptive guidance strategy inspired by Thompson sampling, replacing data-inefficient reinforcement learning (RL). Crucially, our approach requires no additional pretraining or policy fine-tuning. Under realistic experimental constraints, it significantly outperforms RL baselines, discovering higher-fitness sequences in fewer experimental rounds. Results demonstrate the framework’s practicality, robustness, and plug-and-play efficacy in low-data, low-throughput regimes—enabling efficient, sample-constrained protein engineering without task-specific architectural modifications.

Technology Category

Application Category

📝 Abstract

Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent developments in steering protein generative models (e.g diffusion models, language models) offer a promising approach. However, by and large, past studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured by low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages compared to alternatives such as reinforcement learning with protein language models.

Problem

Research questions and friction points this paper is trying to address.

Optimizing protein fitness in large sequence spaces

Steering generative models with limited labeled data

Comparing guidance strategies for real-world wet-lab assays

Innovation

Methods, ideas, or system contributions that make the work stand out.

Steering generative models with small labeled data

Using classifier guidance and posterior sampling

Integrating guidance into adaptive sequence selection

🔎 Similar Papers

No similar papers found.