Expanding functional protein sequence space using high entropy generative models

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This study investigates the relationship between parameter density and experimental performance in protein sequence generative models, aiming to effectively expand the space of functional sequences. Building upon evolutionary sequence data, the authors propose a maximum-entropy model, meDCA, which constructs a high-entropy sparse Boltzmann machine through progressive edge activation and pruning strategies. This approach maximizes the flexibility of the sequence distribution while satisfying coevolutionary statistical constraints. Compared to conventional low-entropy models—such as bmDCA, eaDCA, and edDCA—meDCA expands the samplable sequence space by over fifteen orders of magnitude without compromising functionality, substantially reduces overfitting, and more accurately captures local neutral evolutionary landscapes. In vivo complementation assays confirm the high functional efficacy of the generated sequences.
📝 Abstract
Boltzmann Machines trained on evolutionary sequence data have emerged as a powerful paradigm for the data-driven design of artificial proteins. However, the relationship between model architecture, specifically parameter density, and experimental performance remains poorly understood. Here, we investigate this relationship using the Chorismate Mutase enzyme family as a model system. We compare standard fully connected Boltzmann Machines for Direct Coupling Analysis (bmDCA) with sparse models generated via progressive edge activation (eaDCA) and edge decimation (edDCA). We identify a maximum-entropy model (meDCA) along the decimation trajectory that represents an optimal balance between constraint satisfaction and the flexibility of the probability distribution. We synthesized and tested artificial sequences from all models using an in vivo complementation assay, finding that all architectures, regardless of sparsity, generate functional enzymes with high success rates, even at significant divergence from natural sequences. Despite this functional equivalence, we demonstrate that the meDCA model samples a viable sequence space that is more than fifteen orders of magnitude larger than its low-entropy counterparts. Furthermore, comparative analyses reveal that high-entropy models systematically minimize overfitting and better capture the local neutral spaces surrounding natural proteins. These findings suggest that while various models satisfying coevolutionary statistics can generate functional sequences, high-entropy Boltzmann Machines provide a superior representation of the underlying evolutionary fitness landscape.
Problem

Research questions and friction points this paper is trying to address.

protein design
Boltzmann Machines
sequence space
model sparsity
evolutionary fitness landscape
Innovation

Methods, ideas, or system contributions that make the work stand out.

high-entropy generative models
Boltzmann Machines
protein design
sequence space expansion
Direct Coupling Analysis
R
Roberto Netti
Sorbonne Université, CNRS, Department of Computational, Quantitative and Synthetic Biology—CQSB, 75005 Paris, France
E
Emily Hinds
Center for Physics of Evolving Systems, University of Chicago, Chicago, IL, USA; Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL, USA
F
Francesco Calvanese
Institut de Physique Théorique, Université Paris-Saclay, CEA, Gif-sur-Yvette, France
R
Rama Ranganathan
Center for Physics of Evolving Systems, University of Chicago, Chicago, IL, USA; Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, IL, USA
Martin Weigt
Martin Weigt
Sorbonne Université Paris (former Université Pierre & Marie Curie)
computational biologystatistical physicsbiological physicsstatistical inferencequantitative biology
Francesco Zamponi
Francesco Zamponi
Sapienza Università di Roma
PhysicsStatistical mechanics