🤖 AI Summary
To address key challenges in Vision-Language-Action (VLA) models—including insufficient modeling of scene details in complex spatial environments, a significant modality gap between visual perception and low-level actions, and misalignment between visual prediction and action generation objectives leading to training instability—this paper proposes a hybrid vision-action modality framework. Our method introduces: (1) a shared discrete latent space unifying visual observations and primitive actions; (2) an Implicit Visual Chain-of-Thought mechanism that internalizes visual dynamics as an inductive bias for motion planning; and (3) the Vision-Integrated Trajectory Alignment (VITA) architecture, jointly optimizing autoregressive action generation, future-frame prediction, and action decoding. Evaluated on CALVIN, LIBERO, and SimplerEnv benchmarks, our approach achieves absolute improvements of 14.5%, 9.6%, and 12.1%, respectively, with an average success rate of 80.5% across six tasks—substantially outperforming state-of-the-art methods.
📝 Abstract
Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5%, 9.6% and 12.1% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.