From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

📝 Abstract

Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at https://github.com/RUCKBReasoning/From_Pixels_to_Tokens.

Problem

Research questions and friction points this paper is trying to address.

latent actions

vision-language-action models

action supervision

heterogeneous datasets

intermediate representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent actions

vision-language-action models

discrete tokens