Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge of protein fitness optimization, where the combinatorial sequence space is vast and high-fitness variants are exceedingly sparse. The authors propose a novel approach that distills evolutionary knowledge from pretrained protein language models into a compact latent space and, for the first time, integrates conditional flow matching with classifier-free guidance to directly generate high-fitness protein sequences via ordinary differential equation sampling—without requiring an additional fitness predictor. To mitigate data scarcity, they further introduce a synthetic data bootstrapping strategy. The method achieves state-of-the-art performance on benchmark tasks for AAV and GFP protein design and demonstrates the efficacy of synthetic data in low-data regimes.

Technology Category

Application Category

📝 Abstract

Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.

Problem

Research questions and friction points this paper is trying to address.

protein fitness optimization

combinatorial landscape

high-fitness variants

gradient-based sampling

data-constrained settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

protein language models

latent space

flow matching