🤖 AI Summary
In protein engineering, jointly optimizing sequence fitness and novelty—while escaping the wild-type neighborhood and preserving biological plausibility—remains a major challenge for data-efficient design. This paper introduces an active learning framework that couples a frozen pretrained generative model (e.g., ProGen) with a dynamically updated surrogate model. Our method innovatively integrates fitness-driven residue importance scoring with biologically constrained sequential Monte Carlo sampling, substantially improving robustness under model misspecification. Across multiple benchmark tasks, our approach consistently matches or surpasses state-of-the-art methods: generated sequences achieve high experimental fitness (>95% functional among top-100 candidates) and high novelty (average sequence identity <30% relative to wild-type). The framework enables interpretable, scalable protein design in low-data regimes, establishing a new paradigm for biologically grounded generative optimization.
📝 Abstract
Designing protein sequences of both high fitness and novelty is a challenging task in data-efficient protein engineering. Exploration beyond wild-type neighborhoods often leads to biologically implausible sequences or relies on surrogate models that lose fidelity in novel regions. Here, we propose ProSpero, an active learning framework in which a frozen pre-trained generative model is guided by a surrogate updated from oracle feedback. By integrating fitness-relevant residue selection with biologically-constrained Sequential Monte Carlo sampling, our approach enables exploration beyond wild-type neighborhoods while preserving biological plausibility. We show that our framework remains effective even when the surrogate is misspecified. ProSpero consistently outperforms or matches existing methods across diverse protein engineering tasks, retrieving sequences of both high fitness and novelty.