ActionCodec: What Makes for Good Action Tokenizers

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing action tokenizers overly prioritize reconstruction fidelity while neglecting their direct impact on vision–language–action (VLA) model training, lacking design principles explicitly optimized for VLA tasks. This work proposes, for the first time, an information-theoretic framework for action tokenizer design grounded in VLA training objectives: maximizing temporal token overlap, minimizing vocabulary redundancy, and enhancing both multimodal mutual information and token independence. Based on these principles, we develop ActionCodec, an efficient tokenizer enabling end-to-end autoregressive VLA training. Integrated into SmolVLM2-2.2B, ActionCodec achieves a 95.5% success rate on the LIBERO benchmark without any robot pretraining; with architectural enhancements, performance further improves to 97.4%, establishing a new state of the art in VLA.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.
Problem

Research questions and friction points this paper is trying to address.

action tokenization
Vision-Language-Action models
VLA optimization
tokenizer design
reconstruction fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

action tokenization
Vision-Language-Action models
information-theoretic design
ActionCodec
multimodal mutual information
🔎 Similar Papers
No similar papers found.