WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the critical deficiency of current AI models in high-level world modeling (WM) and long-horizon procedural planning (PP). To this end, we introduce the first video benchmark explicitly designed for semantic abstract actions—distinct from existing benchmarks focused on low-level motion planning. Our method systematically evaluates models’ capacity to understand and plan over temporally and semantically abstract actions; introduces an action-equivalence discrimination task with background-agnostic design to mitigate spurious cue exploitation; and formalizes evaluation within a partially observable semi-Markov decision process (PO-SMDP) framework. Experiments reveal that state-of-the-art models achieve only 57% accuracy on WM and 38% on PP—substantially below human performance (100%)—thereby exposing fundamental limitations in high-level cognitive modeling and long-range causal reasoning.

Technology Category

Application Category

📝 Abstract
Humans are known to have an internal"world model"that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide"action equivalents"- identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI world modeling and procedural planning capabilities
Distinguishing proper actions from counterfactuals in diverse environments
Assessing temporal and semantic abstraction in action sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-based benchmark for world modeling
Discriminative task with counterfactual distractors
Action equivalents to prevent low-level cues
🔎 Similar Papers
No similar papers found.
D
Delong Chen
Meta FAIR Paris, The Hong Kong University of Science and Technology
W
Willy Chung
Meta FAIR Paris, ISIR Sorbonne Université
Yejin Bang
Yejin Bang
Ph.D. Candidate, HKUST
LLM EvaluationNLPResponsible AI
Z
Ziwei Ji
Meta FAIR Paris, The Hong Kong University of Science and Technology
Pascale Fung
Pascale Fung
Dept. of Electronic & Computer Engineering, the Hong Kong University of Science & Technology
artificial intelligenceconversational AIspeech recognitionnatural language processingAI