๐ค AI Summary
Existing video diffusion models (VDMs) fail to effectively leverage their implicit cross-frame physically consistent motion representations for robot policy learning. To address this, we propose Video2Actโa novel framework that, for the first time, extracts foreground boundaries and inter-frame motion dynamics from intermediate VDM features, enabling a dual-path asynchronous inference architecture: a spatially aware โslow systemโ and a motion-aware โfast system.โ We further introduce a diffusion Transformer-based action head for end-to-end action generation. Our method explicitly incorporates VDM-derived physical priors into policy learning, jointly optimizing perceptual stability and decision-making latency. Evaluated on both simulation and real-world robotic tasks, Video2Act achieves average success rates 7.7% and 21.7% higher than state-of-the-art vision-language-action (VLA) methods, respectively, demonstrating significantly improved generalization and deployment robustness.
๐ Abstract
Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.