π€ AI Summary
Existing vision encoders predominantly rely on single-image reconstruction or pairwise image contrastive learning, emphasizing static representations while struggling to capture task-critical dynamic temporal information. To address this, we propose leveraging video diffusion models (VDMs) to generate discriminative visual representations that jointly encode static scene features and predictive motion dynamicsβand, for the first time, directly employ such representations for robotic policy learning. Methodologically, we introduce a conditional implicit inverse dynamics modeling framework wherein VDM-derived representations serve as input to drive embodied action decisions; we further enhance generalization via multi-source fine-tuning using both robot manipulation data and Internet-sourced human hand manipulation videos. On the CALVIN ABC-D generalization benchmark, our approach achieves an 18.6% relative improvement; in real-world dexterous manipulation tasks, success rates increase by 31.6%. These results demonstrate the efficacy and strong generalization capability of predictive visual representations for universal robotic policy learning.
π Abstract
Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6% increase in success rates for complex real-world dexterous manipulation tasks. Project page at https://video-prediction-policy.github.io