🤖 AI Summary
This work addresses the challenge of protein fitness optimization, where the combinatorial sequence space is vast and high-fitness variants are exceedingly sparse. The authors propose a novel approach that distills evolutionary knowledge from pretrained protein language models into a compact latent space and, for the first time, integrates conditional flow matching with classifier-free guidance to directly generate high-fitness protein sequences via ordinary differential equation sampling—without requiring an additional fitness predictor. To mitigate data scarcity, they further introduce a synthetic data bootstrapping strategy. The method achieves state-of-the-art performance on benchmark tasks for AAV and GFP protein design and demonstrates the efficacy of synthetic data in low-data regimes.
📝 Abstract
Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.