World Reasoning Arena

πŸ“… 2026-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current world model benchmarks overemphasize next-state prediction and visual fidelity while neglecting the integrative simulation capabilities essential for intelligent behavior. To address this gap, this work proposes WR-Arena, a novel benchmark that establishes the first three-dimensional evaluation framework encompassing action simulation fidelity, long-horizon prediction, and simulation-based reasoning for planning. The framework is supported by technical components including multi-step instruction execution, counterfactual trajectory generation, and modeling of long-term physical consistency, enabling systematic assessment of models’ hypothetical reasoning and goal-directed simulation abilities in both structured and open-ended environments. Experiments reveal that existing world models substantially underperform human-level capabilities in complex simulation-based reasoning, demonstrating that WR-Arena can serve as both a diagnostic tool and an evaluation standard for the development of next-generation world models.
πŸ“ Abstract
World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at https://github.com/MBZUAI-IFM/WR-Arena.
Problem

Research questions and friction points this paper is trying to address.

World Models
Simulation Fidelity
Long-horizon Forecast
Simulative Reasoning
Benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

World Models
Simulation Fidelity
Long-horizon Forecasting
Counterfactual Reasoning
Goal-directed Planning
πŸ”Ž Similar Papers
No similar papers found.