Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing video diffusion models (VDMs) fail to effectively leverage their implicit cross-frame physically consistent motion representations for robot policy learning. To address this, we propose Video2Act—a novel framework that, for the first time, extracts foreground boundaries and inter-frame motion dynamics from intermediate VDM features, enabling a dual-path asynchronous inference architecture: a spatially aware “slow system” and a motion-aware “fast system.” We further introduce a diffusion Transformer-based action head for end-to-end action generation. Our method explicitly incorporates VDM-derived physical priors into policy learning, jointly optimizing perceptual stability and decision-making latency. Evaluated on both simulation and real-world robotic tasks, Video2Act achieves average success rates 7.7% and 21.7% higher than state-of-the-art vision-language-action (VLA) methods, respectively, demonstrating significantly improved generalization and deployment robustness.

Technology Category

Application Category

📝 Abstract

Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Extracts coherent motion from video diffusion models for robotics

Integrates spatial and motion-aware representations into action learning

Enables efficient, adaptive robotic manipulation via dual-system design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts foreground boundaries and motion variations from video diffusion models

Uses diffusion transformer action head with refined spatial-motion conditioning

Implements asynchronous dual-system design for efficient adaptive action generation

🔎 Similar Papers

No similar papers found.