Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing robotic policies struggle to jointly model the 3D spatial structure and temporal dynamics of environments, often relying on 2D vision and static pretraining, which leads to poor data efficiency and limited generalization. This work proposes Multi-View Video Diffusion Policy (MV-VDP), the first approach to integrate multi-view video diffusion into policy learning. MV-VDP employs a unified 3D spatiotemporal architecture that simultaneously predicts multi-view affordance heatmaps and RGB video sequences, thereby co-modeling actions and their resulting environmental dynamics. Requiring no additional pretraining and enabling end-to-end learning, the method achieves state-of-the-art performance using only ten demonstration trajectories on both Meta-World benchmarks and real-world robotic tasks. It significantly outperforms existing approaches in data efficiency, robustness, out-of-distribution generalization, and future video prediction quality.
📝 Abstract
Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.
Problem

Research questions and friction points this paper is trying to address.

robotic manipulation
3D spatio-temporal understanding
video diffusion policy
multi-view perception
environment dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view video diffusion
3D spatio-temporal modeling
action-conditioned video prediction
data-efficient robotic manipulation
heatmaps for policy learning
🔎 Similar Papers
No similar papers found.
Peiyan Li
Peiyan Li
Ludwig-Maximilians-Universität München
data mininggraph mining
Y
Yixiang Chen
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Yuan Xu
Yuan Xu
Associate Professor, Cumming School of Medicine, University of Caglary
Health Data MethodsEpidemiologyHealth Services Research
J
Jiabing Yang
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
X
Xiangnan Wu
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Jun Guo
Jun Guo
Upenn (2020-) << Tsinghua (2016-2020)
Pattern RecognitionMachine Learning
Nan Sun
Nan Sun
University of New South Wales
CybersecurityArtificial Intelligence Applications
L
Long Qian
Xi’an Jiaotong University
Xinghang Li
Xinghang Li
Beijing Academy of Artificial Intelligence; Tsinghua University
Computer VisionRobot NavigationManipulation
Xin Xiao
Xin Xiao
ByteDance Research
VLAVLM
J
Jing Liu
FiveAges
N
Nianfeng Liu
FiveAges
Tao Kong
Tao Kong
ByteDance Research
Robot Foundation ModelRobot LearningComputer Vision
Yan Huang
Yan Huang
Institute of Automation, Chinese Academy of Sciences
computer visiondeep learningmultimodal learning
Liang Wang
Liang Wang
National Lab of Pattern Recognition
Computer VisionPattern RecognitionMachine Learning
Tieniu Tan
Tieniu Tan
Institute of Automation, Chinese Academy of Sciences