🤖 AI Summary
Existing vision-language-action (VLA) models rely on pretraining with discrete, static image data, limiting their ability to capture physical dynamics and temporal dependencies—thus necessitating large-scale expert demonstration trajectories. To address this, we propose a video-driven end-to-end control paradigm that, for the first time, integrates internet-scale video foundation models into robotic action modeling, yielding the Video-Action Model (VAM). We introduce a flow-matching–based action decoder that serves as an explicit inverse-dynamics model to encode physical causality, and achieve semantic planning and motion control decoupling in latent space. Evaluated on both simulation and real-robot benchmarks, VAM achieves state-of-the-art performance, improves sample efficiency by 10×, and accelerates convergence by 2× compared to prior methods.
📝 Abstract
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.