DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenges of action label scarcity and limited data coverage faced by general-purpose robots operating in open, contact-rich environments. The authors propose a world model based on continuous latent actions, pretrained on 44,000 hours of unlabeled first-person human video and subsequently adapted via post-training on a small-scale robotic dataset to enable accurate and controllable physical simulation. This approach constructs the largest pretraining video dataset for robotic world models to date, supports real-time inference at 10.81 FPS, and demonstrates superior physical understanding and action controllability across multiple out-of-distribution benchmarks. The resulting model is applicable to teleoperation, policy evaluation, and model-based planning.

Technology Category

Application Category

📝 Abstract

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

Problem

Research questions and friction points this paper is trying to address.

world model

dexterous robotics

action labels

human videos

generalist agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

world model

latent actions

human video pretraining