ENSEMBITS: an alphabet of protein conformational ensembles

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
Existing protein structure tokenizers model only static local geometry, struggling to capture the concerted motions and polymorphism inherent in conformational ensembles. This work proposes Ensembits, the first tokenizer designed specifically for dynamic protein conformational ensembles, which leverages a residual vector-quantized variational autoencoder (VQ-VAE) and a frame distillation objective to learn discrete dynamic representations from large-scale molecular dynamics data. Its key innovations include a discretization framework tailored for flexible proteins, frame distillation to mitigate data sparsity, and the ability to infer dynamic tokens from a single static structure. Experiments demonstrate that Ensembits outperforms existing methods in RMSF prediction and residue-level motion ANOVA tests, and matches or exceeds static tokenizers in tasks such as EC/GO annotation, binding site identification, affinity prediction, and zero-shot mutation effect estimation—despite requiring substantially less pretraining data.
📝 Abstract
Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.
Problem

Research questions and friction points this paper is trying to address.

protein conformational ensembles
protein dynamics
structure tokenization
molecular dynamics
conformational states
Innovation

Methods, ideas, or system contributions that make the work stand out.

conformational ensembles
protein dynamics
structure tokenizer
frame distillation
Residual VQ-VAE
🔎 Similar Papers
No similar papers found.