🤖 AI Summary
This work proposes the first comprehensive evaluation framework for world models in embodied intelligence, introducing Wow-wo-val—a benchmark comprising 609 robot manipulation trajectories. The framework assesses video foundation models as world models across five dimensions: perception, planning, prediction, generalization, and execution, using 22 quantitative metrics and human preference evaluations (with inter-rater correlation >0.93) to measure generative fidelity and robustness. A novel inverse dynamics Turing test is introduced, revealing significant gaps between current models and human expectations—particularly in long-horizon planning (scoring 17.27), physical consistency (peaking at 68.02), and task execution success rate (near zero for most models, while Wow achieves 40.74%).
📝 Abstract
As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models'generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models'execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.