From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
📝 Abstract
Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at https://github.com/RUCKBReasoning/From_Pixels_to_Tokens.
Problem

Research questions and friction points this paper is trying to address.

latent actions
vision-language-action models
action supervision
heterogeneous datasets
intermediate representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent actions
vision-language-action models
discrete tokens
action supervision
heterogeneous datasets
🔎 Similar Papers
Yihan Lin
Yihan Lin
Assistant Professor, Xiamen University
Brain inspired VisionDeep learningNeuromorphic engineeringComplex networks
H
Haoyang Li
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
Yang Li
Yang Li
Renmin Unversity of China
H
Haitao Shen
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
Y
Yihan Zhao
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
C
Chao Shao
School of Information, Renmin University of China, Beijing, China; Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China
Jing Zhang
Jing Zhang
Renmin University of China
large model alignmentmodel compression & inference optimizationdata intelligence