Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing vision encoders predominantly rely on single-image reconstruction or pairwise image contrastive learning, emphasizing static representations while struggling to capture task-critical dynamic temporal information. To address this, we propose leveraging video diffusion models (VDMs) to generate discriminative visual representations that jointly encode static scene features and predictive motion dynamics—and, for the first time, directly employ such representations for robotic policy learning. Methodologically, we introduce a conditional implicit inverse dynamics modeling framework wherein VDM-derived representations serve as input to drive embodied action decisions; we further enhance generalization via multi-source fine-tuning using both robot manipulation data and Internet-sourced human hand manipulation videos. On the CALVIN ABC-D generalization benchmark, our approach achieves an 18.6% relative improvement; in real-world dexterous manipulation tasks, success rates increase by 31.6%. These results demonstrate the efficacy and strong generalization capability of predictive visual representations for universal robotic policy learning.

Technology Category

Application Category

📝 Abstract

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io

Problem

Research questions and friction points this paper is trying to address.

Enhancing robot policies with predictive visual representations

Addressing static vision encoders' neglect of dynamic task aspects

Improving robot action learning via video diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video diffusion models for future frame prediction

Fine-tunes pre-trained video models with robot data

Learns inverse dynamics from predicted representations

🔎 Similar Papers

No similar papers found.