ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In protein engineering, jointly optimizing sequence fitness and novelty—while escaping the wild-type neighborhood and preserving biological plausibility—remains a major challenge for data-efficient design. This paper introduces an active learning framework that couples a frozen pretrained generative model (e.g., ProGen) with a dynamically updated surrogate model. Our method innovatively integrates fitness-driven residue importance scoring with biologically constrained sequential Monte Carlo sampling, substantially improving robustness under model misspecification. Across multiple benchmark tasks, our approach consistently matches or surpasses state-of-the-art methods: generated sequences achieve high experimental fitness (>95% functional among top-100 candidates) and high novelty (average sequence identity <30% relative to wild-type). The framework enables interpretable, scalable protein design in low-data regimes, establishing a new paradigm for biologically grounded generative optimization.

Technology Category

Application Category

📝 Abstract
Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.
Problem

Research questions and friction points this paper is trying to address.

Designing high-fitness novel protein sequences efficiently
Exploring beyond wild-type without biological implausibility
Improving surrogate model fidelity in novel regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active learning with frozen generative model
Fitness-relevant residue selection integration
Biologically-constrained Sequential Monte Carlo sampling
🔎 Similar Papers
No similar papers found.