Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This work proposes the first comprehensive evaluation framework for world models in embodied intelligence, introducing Wow-wo-val—a benchmark comprising 609 robot manipulation trajectories. The framework assesses video foundation models as world models across five dimensions: perception, planning, prediction, generalization, and execution, using 22 quantitative metrics and human preference evaluations (with inter-rater correlation >0.93) to measure generative fidelity and robustness. A novel inverse dynamics Turing test is introduced, revealing significant gaps between current models and human expectations—particularly in long-horizon planning (scoring 17.27), physical consistency (peaking at 68.02), and task execution success rate (near zero for most models, while Wow achieves 40.74%).

Technology Category

Application Category

📝 Abstract
As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models'generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models'execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
Problem

Research questions and friction points this paper is trying to address.

world models
embodied AI
video foundation models
perceptual fidelity
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Turing Test
World Model Evaluation
Video Foundation Models
Inverse Dynamics Model
Spatiotemporal Consistency
Chun-Kai Fan
Chun-Kai Fan
Peking University
Xiaowei Chi
Xiaowei Chi
The Hong Kong University of Science and Technology
Multimodal GenerationRoboticsComputer Vision
X
Xiaozhu Ju
Beijing Innovation Center of Humanoid Robotics
H
Hao Li
Beijing Innovation Center of Humanoid Robotics
Y
Yong Bao
Beijing Innovation Center of Humanoid Robotics
Y
Yu-Kai Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Lizhang Chen
Lizhang Chen
Ph.D. student, University of Texas at Austin
training efficiency
Z
Zhiyuan Jiang
Beijing Innovation Center of Humanoid Robotics
Kuangzhi Ge
Kuangzhi Ge
Peking University
Multimodal LearningEmbodied AI
Y
Ying Li
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
W
Weishi Mi
Beijing Innovation Center of Humanoid Robotics
Qingpo Wuwu
Qingpo Wuwu
Imperial College London | Peking University
Neural RenderingPhysical SimulationPDEs Solving
P
Peidong Jia
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Yulin Luo
Yulin Luo
Peking University
Data-centric AILLMVLMEmbodied AI
Kevin Zhang
Kevin Zhang
Peking University
ML
Z
Zhiyuan Qin
Beijing Innovation Center of Humanoid Robotics
Y
Yong Dai
Beijing Innovation Center of Humanoid Robotics
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
The Hong Kong University of Science and Technology
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
J
Jian Tang
Beijing Innovation Center of Humanoid Robotics