Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

๐Ÿ“… 2025-12-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video diffusion models (VDMs) fail to effectively leverage their implicit cross-frame physically consistent motion representations for robot policy learning. To address this, we propose Video2Actโ€”a novel framework that, for the first time, extracts foreground boundaries and inter-frame motion dynamics from intermediate VDM features, enabling a dual-path asynchronous inference architecture: a spatially aware โ€œslow systemโ€ and a motion-aware โ€œfast system.โ€ We further introduce a diffusion Transformer-based action head for end-to-end action generation. Our method explicitly incorporates VDM-derived physical priors into policy learning, jointly optimizing perceptual stability and decision-making latency. Evaluated on both simulation and real-world robotic tasks, Video2Act achieves average success rates 7.7% and 21.7% higher than state-of-the-art vision-language-action (VLA) methods, respectively, demonstrating significantly improved generalization and deployment robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.
Problem

Research questions and friction points this paper is trying to address.

Extracts coherent motion from video diffusion models for robotics
Integrates spatial and motion-aware representations into action learning
Enables efficient, adaptive robotic manipulation via dual-system design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts foreground boundaries and motion variations from video diffusion models
Uses diffusion transformer action head with refined spatial-motion conditioning
Implements asynchronous dual-system design for efficient adaptive action generation
๐Ÿ”Ž Similar Papers
No similar papers found.
Yueru Jia
Yueru Jia
School of Computer Science, Peking University
RoboticsAIGCComputer Vision
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
S
Shengbang Liu
Sun Yat-sen University
R
Rui Zhou
Wuhan University
W
Wanhe Yu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yuyang Yan
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
Y
Yandong Guo
AI2Robotics
Boxin Shi
Boxin Shi
Peking University
Computer VisionComputational Photography
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models