Low-N Protein Activity Optimization with FolDE

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Protein directed evolution under low-data regimes suffers from high experimental costs, while existing active learning–driven approaches (e.g., ALDE) exhibit poor generalization due to greedy, homogeneous batch selection. To address these limitations, we propose FolDE—a framework that synergistically integrates protein language model (PLM) priors with scarce experimental data. FolDE introduces a naturalness-aware pretraining strategy and a “perpetual-lie” batch selector, jointly promoting mutational diversity and prediction accuracy. Comprehensive evaluation across 20 protein targets demonstrates that FolDE significantly outperforms state-of-the-art methods: it improves the discovery rate of the top 10% most active variants by 23% and increases the probability of identifying top 1% variants by 55%. These gains underscore FolDE’s enhanced sample efficiency and robustness in low-data optimization scenarios.

Technology Category

Application Category

📝 Abstract
Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant-liar batch selector, which improves batch diversity; this is important in multi-mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open-source software, making efficient protein optimization accessible to any laboratory.
Problem

Research questions and friction points this paper is trying to address.

Optimizing protein activity with limited experimental data
Overcoming training data homogeneity in active learning evolution
Improving mutant discovery efficiency using computational methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

FolDE uses naturalness-based warm-starting for prediction
It employs constant-liar batch selector for diversity
Combines protein language model outputs with measurements
🔎 Similar Papers
No similar papers found.
J
Jacob B. Roberts
Joint BioEnergy Institute, Emeryville, CA, USA; Department of Bioengineering, UCSF / UCB; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
C
Catherine R. Ji
Princeton University
I
Isaac Donnell
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Chemistry Department, UCB
T
Thomas D. Young
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
A
Allison N. Pearson
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Department of Plant and Microbial Biology, UCB
G
Graham A. Hudson
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; QB3 Institute, University of California, Berkeley, CA, USA
L
Leah S. Keiser
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Department of Chemical and Biomolecular Engineering, UCB
M
Mia Wesselkamper
Joint BioEnergy Institute, Emeryville, CA, USA; Bioengineering Department, UCB
P
Peter H. Winegar
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; QB3 Institute, University of California, Berkeley, CA, USA
J
Janik Ludwig
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Faculty of Biology, Ludwig-Maximilians-Universität München
S
Sarah H. Klass
Joint BioEnergy Institute, Emeryville, CA, USA; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Department of Chemical and Biomolecular Engineering, UCB
I
Isha V. Sheth
Joint BioEnergy Institute, Emeryville, CA, USA; Chemical Biology Department, UCB
E
Ezechinyere C. Ukabiala
Joint BioEnergy Institute, Emeryville, CA, USA; Department of Bioengineering, UCSF / UCB; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
M
Maria C. T. Astolfi
Joint BioEnergy Institute, Emeryville, CA, USA; Department of Bioengineering, UCSF / UCB; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Benjamin Eysenbach
Benjamin Eysenbach
Princeton University
Reinforcement Learning
J
Jay D. Keasling
Joint BioEnergy Institute, Emeryville, CA, USA; Department of Bioengineering, UCSF / UCB; Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA; Chemistry Department, UCB; QB3 Institute, University of California, Berkeley, CA, USA; Department of Chemical and Biomolecular Engineering, UCB; TheNovoNordiskFoundationCenterforBiosustainability,TechnicalUniversityDenmark,Kemitorvet,KongensLyngby,Denmark