VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the poor action coherence and low real-world success rates of vision-language-action (VLA) models on long-horizon tasks. We propose a vector-quantized action tokenizer (VQ-Action Tokenizer), trained on large-scale synthetic and real-world action trajectory data. By incorporating spatiotemporal dynamic modeling and scalable representation learning, our method significantly reduces the domain gap between synthetic and real action trajectories, enabling efficient utilization of massive synthetic datasets. Key contributions include: (i) the first empirical verification that the domain gap between synthetic and real action trajectories is small enough to enable zero-shot cross-task transfer; (ii) improved long-horizon reasoning efficiency and action coherence; and (iii) up to 30% higher task success rate on real robotic platforms, demonstrating strong generalization in simulation and robust effectiveness in real-world deployment.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application domains.Project website: https://xiaoxiao0406.github.io/vqvla.github.io

Problem

Research questions and friction points this paper is trying to address.

Scaling action tokenizers for vision-language-action models

Bridging domain gap between synthetic and real action data

Improving real-world robotic task performance via synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector quantization for action tokenizer scaling

Leveraging large-scale synthetic trajectory data

Zero-shot adaptation to diverse downstream tasks

🔎 Similar Papers

Mamba Fusion: Learning Actions Through Questioning