VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning transferable dynamics from unlabeled real-world videos and effectively applying them in novel environments. To this end, the authors propose the dynamics-augmented Latent Dynamics Model (dLDM), which disentangles action-conditioned dynamics from visual appearance by leveraging a pretrained video diffusion model to handle visual information. This design enables dLDM to focus on learning compact, task-relevant dynamic representations. Coupled with autoregressive latent code modeling, the approach supports long-horizon inference and policy learning. The method achieves the first successful direct learning of transferable world models from raw real-world videos, demonstrating up to a 70% improvement in task success rates on complex manipulation tasks, generating temporally coherent long-horizon action sequences, and significantly outperforming existing approaches on the CALVIN robotic benchmark.

Technology Category

Application Category

📝 Abstract
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
Problem

Research questions and friction points this paper is trying to address.

transferable knowledge
real-world videos
latent dynamics
video understanding
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Dynamics Model
video diffusion model
transferable knowledge
real-world videos
autoregressive modeling
🔎 Similar Papers
No similar papers found.