ActionCodec: What Makes for Good Action Tokenizers

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing action tokenizers overly prioritize reconstruction fidelity while neglecting their direct impact on vision–language–action (VLA) model training, lacking design principles explicitly optimized for VLA tasks. This work proposes, for the first time, an information-theoretic framework for action tokenizer design grounded in VLA training objectives: maximizing temporal token overlap, minimizing vocabulary redundancy, and enhancing both multimodal mutual information and token independence. Based on these principles, we develop ActionCodec, an efficient tokenizer enabling end-to-end autoregressive VLA training. Integrated into SmolVLM2-2.2B, ActionCodec achieves a 95.5% success rate on the LIBERO benchmark without any robot pretraining; with architectural enhancements, performance further improves to 97.4%, establishing a new state of the art in VLA.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.

Problem

Research questions and friction points this paper is trying to address.

action tokenization

Vision-Language-Action models

VLA optimization

tokenizer design

reconstruction fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

action tokenization

Vision-Language-Action models

information-theoretic design