🤖 AI Summary
Existing protein structure tokenization methods lack a unified evaluation framework, particularly at the fine-grained local substructure level. To address this, we propose StructTokenBench—the first benchmark dedicated to evaluating local structural tokenization—and introduce AminoAseed, an efficient tokenization strategy that enhances representation quality and codebook utilization via optimized codebook gradient updates and balanced dimension–scale trade-offs. Our contributions are threefold: (1) establishing the first fine-grained structural tokenization evaluation paradigm; (2) introducing AminoAseed for improved codebook learning; and (3) constructing a unified framework integrating structural tokenization, codebook optimization, and 3D geometric representation. Evaluated on 24 supervised tasks, our approach achieves an average performance gain of 6.31%, a 12.83% improvement in sensitivity, and a 124.03% increase in codebook utilization—substantially outperforming ESM3.
📝 Abstract
Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduce StructTokenBench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine-grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under-utilization led us to develop AminoAseed, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively.