WoW: Towards a World omniscient World model Through Embodied Interaction

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video models (e.g., Sora) rely on passive observation and struggle to model physical causality, whereas humans develop intuitive physics understanding through active interaction. To address this, we propose an embodied learning paradigm driven by large-scale real-world robot interactions—2 million diverse trajectories—and introduce WoW, a 14B-parameter generative world model. Methodologically, WoW employs a DiT architecture to synthesize physically consistent videos and integrates SOPHIA: a vision-language-model-based mechanism that dynamically constrains and refines generation via verifiable physical reasoning. Additionally, an inverse dynamics model is co-trained to close the loop among “imagination,” “planning,” and “action.” Evaluated on our newly constructed WoWBench benchmark, WoW achieves substantial gains in collision dynamics modeling, object permanence, and causal reasoning—marking the first successful large-scale, interaction-driven instantiation of intuitive physical reasoning.

Technology Category

Application Category

📝 Abstract
Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Developing AI world models through embodied robot interactions
Addressing physical causality gaps in passive video models
Creating physically consistent video generation with causal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training world model with robot interaction trajectories
Using vision-language agents to refine generated outputs
Closing imagination-action loop via inverse dynamics model
🔎 Similar Papers
2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
P
Peidong Jia
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Chun-Kai Fan
Chun-Kai Fan
Peking University
X
Xiaozhu Ju
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
W
Weishi Mi
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
K
Kevin Zhang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhiyuan Qin
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
W
Wanxin Tian
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Kuangzhi Ge
Kuangzhi Ge
Peking University
Multimodal LearningEmbodied AI
H
Hao Li
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zezhong Qian
Zezhong Qian
XianJiaotongUniversity
World ModelAutonomous DrivingVideo GenerationRobot Manipulation
A
Anthony Chen
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Q
Qiang Zhou
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yueru Jia
Yueru Jia
School of Computer Science, Peking University
RoboticsAIGCComputer Vision
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yong Dai
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Qingpo Wuwu
Qingpo Wuwu
Imperial College London | Peking University
Neural RenderingPhysical SimulationPDEs Solving
C
Chengyu Bai
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yu-Kai Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Ying Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Lizhang Chen
Lizhang Chen
Ph.D. student, University of Texas at Austin
training efficiency
Y
Yong Bao
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Z
Zhiyuan Jiang
Beijing Innovation Center of Humanoid Robotics, State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Jiacheng Zhu
Jiacheng Zhu
MIT
Machine LearningFoundation ModelsOptimal TransportBayesian modeling
K
Kai Tang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University