Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Imitation learning suffers from heavy reliance on action annotations and limited generalization capability. To address this, we propose the Unified World Model (UWM), a shared multimodal Transformer architecture that—uniquely—enables independent timestep control for both video and action diffusion processes within a single framework, supporting policy learning, forward/inverse dynamics modeling, and video generation. A modality-specific timestep scheduling mechanism allows UWM to perform end-to-end pretraining directly on large-scale action-free video data, seamlessly integrating imitation learning with world modeling. After pretraining on multi-task robotic datasets, fine-tuned policies significantly outperform pure imitation learning baselines in both simulation and real-robot tasks, demonstrating superior cross-task generalization and environmental robustness.

Technology Category

Application Category

📝 Abstract
Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.
Problem

Research questions and friction points this paper is trying to address.

Leveraging video and action data for scalable robot policy learning.
Integrating action and video diffusion in a unified transformer architecture.
Improving imitation learning by utilizing action-free video datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified transformer for video and action diffusion
Independent diffusion timesteps per modality
Flexible policy and dynamics representation
🔎 Similar Papers
No similar papers found.
Chuning Zhu
Chuning Zhu
University of Washington
Reinforcement learningRobotics
Raymond Yu
Raymond Yu
University of Washington
Robot learningComputer VisionReinforcement Learning
S
Siyuan Feng
Toyota Research Institute
B
B. Burchfiel
Toyota Research Institute
P
Paarth Shah
Toyota Research Institute
A
Abhishek Gupta
Paul G. Allen School of Computer Science and Engineering, University of Washington