Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

πŸ“… 2024-12-19
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 4
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision encoders predominantly rely on single-image reconstruction or pairwise image contrastive learning, emphasizing static representations while struggling to capture task-critical dynamic temporal information. To address this, we propose leveraging video diffusion models (VDMs) to generate discriminative visual representations that jointly encode static scene features and predictive motion dynamicsβ€”and, for the first time, directly employ such representations for robotic policy learning. Methodologically, we introduce a conditional implicit inverse dynamics modeling framework wherein VDM-derived representations serve as input to drive embodied action decisions; we further enhance generalization via multi-source fine-tuning using both robot manipulation data and Internet-sourced human hand manipulation videos. On the CALVIN ABC-D generalization benchmark, our approach achieves an 18.6% relative improvement; in real-world dexterous manipulation tasks, success rates increase by 31.6%. These results demonstrate the efficacy and strong generalization capability of predictive visual representations for universal robotic policy learning.

Technology Category

Application Category

πŸ“ Abstract
Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io
Problem

Research questions and friction points this paper is trying to address.

Enhancing robot policies with predictive visual representations
Addressing static vision encoders' neglect of dynamic task aspects
Improving robot action learning via video diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video diffusion models for future frame prediction
Fine-tunes pre-trained video models with robot data
Learns inverse dynamics from predicted representations
πŸ”Ž Similar Papers
No similar papers found.
Y
Yucheng Hu
IIIS, Tsinghua University; Shanghai Qizhi Institute; RobotEra
Yanjiang Guo
Yanjiang Guo
Tsinghua University
Embodied AIGenerative Model
P
Pengchao Wang
RobotEra
X
Xiaoyu Chen
IIIS, Tsinghua University; Shanghai Qizhi Institute
Yen-Jen Wang
Yen-Jen Wang
UC Berkeley
Robotics
Jianke Zhang
Jianke Zhang
Tsinghua University, IIIS
Embodied AI. VLM. Multimodal Learning
K
K. Sreenath
University of California, Berkeley
Chaochao Lu
Chaochao Lu
Shanghai AI Laboratory
Causal AI
Jianyu Chen
Jianyu Chen
Assistant Professor, Tsinghua University
AIRobotics