Protein Counterfactuals via Diffusion-Guided Latent Optimization

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While existing deep learning models achieve high accuracy in predicting protein properties, they often fail to provide actionable guidance for protein engineering. To address this limitation, this work proposes the MCCOP framework, which— for the first time—incorporates a pretrained diffusion model as a manifold prior for counterfactual protein generation. Operating in a continuous joint latent space of sequence and structure, MCCOP employs multi-objective optimization to balance desired functional properties, minimal mutational perturbations, and structural foldability, yielding sparse, interpretable variants grounded in biophysical principles. In tasks including GFP fluorescence recovery, thermal stability enhancement, and E3 ligase activity restoration, MCCOP generates mutations that are not only sparser and more plausible but also align closely with established mechanistic insights—such as chromophore packing and hydrophobic core stabilization—thereby unifying model interpretability with hypothesis-driven design.

Technology Category

Application Category

📝 Abstract
Deep learning models can predict protein properties with unprecedented accuracy but rarely offer mechanistic insight or actionable guidance for engineering improved variants. When a model flags an antibody as unstable, the protein engineer is left without recourse: which mutations would rescue stability while preserving function? We introduce Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model's prediction to a desired target state. MCCOP operates in a continuous joint sequence-structure latent space and employs a pretrained diffusion model as a manifold prior, balancing three objectives: validity (achieving the target property), proximity (minimizing mutations), and plausibility (producing foldable proteins). We evaluate MCCOP on three protein engineering tasks - GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery - and show that it generates sparser, more plausible counterfactuals than both discrete and continuous baselines. The recovered mutations align with known biophysical mechanisms, including chromophore packing and hydrophobic core consolidation, establishing MCCOP as a tool for both model interpretation and hypothesis-driven protein design. Our code is publicly available at github.com/weroks/mccop.
Problem

Research questions and friction points this paper is trying to address.

protein engineering
counterfactual explanation
stability rescue
functional preservation
minimal mutation
Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual optimization
diffusion model
protein engineering
latent space
manifold prior
🔎 Similar Papers
No similar papers found.
W
Weronika Kłos
Machine Learning Group, Technische Universität Berlin, Berlin, Germany; Berlin Institute for the Foundations of Learning and Data (BIFOLD)
Sidney Bender
Sidney Bender
Technical University of Berlin
Deep LearningExplainable AITrustworthy MLGenerative Modelling
L
Lukas Kades
BASF Digital Solutions GmbH, Ludwigshafen am Rhein, Germany