Large Video Planner Enables Generalizable Robot Control

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the limited generalization capability and task-specific training dependency of existing robotic decision-making models, this paper introduces the first large-scale, open-source video foundation model for robotic planning. Methodologically, it departs from conventional vision-language-action joint modeling paradigms and proposes a novel large-scale video pretraining framework grounded in internet-scale human activity videos. The framework integrates a video diffusion model with a spatiotemporal Transformer architecture to enable end-to-end, video-level spatiotemporal plan generation. A dedicated video-to-action decoder is further designed to support zero-shot mapping to real-robot execution. Extensive zero-shot deployment on third-party out-of-distribution tasks and physical robot platforms demonstrates strong instruction-following capability, cross-task and cross-environment generalization, and real-world feasibility. The model and associated dataset are released publicly.

Technology Category

Application Category

📝 Abstract

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

Problem

Research questions and friction points this paper is trying to address.

Develops a robot foundation model using large-scale video pretraining

Generates zero-shot video plans for novel tasks and environments

Extracts executable robot actions from video plans for physical execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large-scale video pretraining for robot foundation models

Generates zero-shot video plans for novel tasks and scenes

Extracts executable robot actions from video plans

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence