Video Generators are Robot Policies

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Current visual-motor policy learning faces two key bottlenecks: poor generalization to perception/action distribution shifts and heavy reliance on large-scale human demonstration data. To address these, we propose the first framework that integrates large-scale video generation models into robot policy learning, using video generation as a proxy task. Our approach introduces an end-to-end, modular architecture that jointly models video and action sequences. Crucially, it requires only a small number of action-annotated demonstrations while leveraging vast amounts of unlabeled video data to enhance representation robustness. This significantly improves generalization across objects, backgrounds, and tasks, as well as sample efficiency. Evaluated on both simulation and real-world robotic platforms, our method consistently outperforms behavioral cloning baselines, demonstrating strong cross-domain transferability and practical deployability.

Technology Category

Application Category

📝 Abstract

Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency. Our method shows strong generalization to unseen objects, backgrounds, and tasks, both in simulation and the real world. We further highlight that task success is closely tied to the generated video, with action-free video data providing critical benefits for generalizing to novel tasks. By leveraging large-scale video generative models, we achieve superior performance compared to traditional behavior cloning, paving the way for more scalable and data-efficient robot policy learning.

Problem

Research questions and friction points this paper is trying to address.

Generalize under perceptual or behavioral shifts

Overcome human demonstration data size constraints

Improve robustness and sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video generation as robot policy learning

Modular video and action generation framework

Leveraging large-scale video generative models

🔎 Similar Papers

No similar papers found.