Adaptive Protein Tokenization

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current protein structure tokenization methods rely on aggregating local neighborhood information, which struggles to effectively capture global structural relationships and thus faces performance bottlenecks in generation and representation tasks. This work proposes a global, hierarchical tokenization framework that constructs adaptive representations by progressively incorporating finer structural details across layers, thereby avoiding sequence compression and error accumulation. The approach integrates nonlinear probes and introduces zero-shot strategies for protein downsizing and affinity maturation. Evaluated on reconstruction, generation, and CATH classification tasks, the method matches or surpasses existing local tokenization models, substantially enhancing the feasibility of protein design. Notably, the nonlinear probes demonstrate exceptional efficiency in information utilization.

Technology Category

Application Category

📝 Abstract
Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.
Problem

Research questions and friction points this paper is trying to address.

protein tokenization
local tokenization
generative modeling
representation learning
error accumulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive tokenization
global representation
protein structure modeling
zero-shot protein design
information-content adaptation
🔎 Similar Papers
No similar papers found.
R
Rohit Dilip
California Institute of Technology
A
Ayush Varshney
Carnegie Mellon University
David Van Valen
David Van Valen
California Institute of Technology
Biological PhysicsSystems Biology