LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
Existing protein sequence generation methods often neglect evolutionary constraints, resulting in sequences with weak family specificity and insufficient biophysical plausibility. This work proposes LineageFlow, which for the first time integrates lineage priors derived from ancestral sequence reconstruction into a flow-matching framework. By leveraging a Dirichlet flow-matching model together with a rerouting intervention mechanism, LineageFlow enables structured mutagenesis from evolutionary scaffolds and goal-directed sampling without per-step guidance. The method generates sequences across multiple protein families that exhibit high family validity and structural confidence while preserving novelty and diversity, and it successfully achieves zero-shot enzyme design.
📝 Abstract
Protein sequence generation for engineering requires samples that are biophysically plausible and, when targeting a family/domain, remain recognizable members while exploring within-family diversity. Current discrete generative models typically start from uniform or masked-token noise, which discards strong position-specific constraints induced by evolution and forces the model to reconstruct conserved residues from scratch, leading to weak family control and low plausibility. We propose \emph{LineageFlow}, a Dirichlet flow-matching model that initializes generation from lineage priors derived from ancestral sequence reconstruction, turning generation into structured mutation from an evolved scaffold. Across diverse protein families, LineageFlow achieves family validity close to held-out natural sequences and improves predicted structural confidence over uniform-/mask-initialized baselines while maintaining substantial novelty and diversity. Finally, we introduce \emph{rerouting}, a single intermediate-time mutate--select--amplify intervention that enables objective-guided sampling without per-step predictor guidance and yields further gains in plausibility, including a zero-shot enzyme generation case study. Code is available at https://github.com/Jinx-byebye/LineageFlow.
Problem

Research questions and friction points this paper is trying to address.

protein sequence generation
family-aware
biophysical plausibility
evolutionary constraints
sequence diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
ancestral sequence reconstruction
protein sequence generation
family-aware generation
rerouting
🔎 Similar Papers
No similar papers found.