WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of world-state prediction capabilities in existing video generation models, which hinders assessment of their reasoning plausibility across physical, social, logical, and informational dimensions. The authors propose the first benchmark framework specifically designed to evaluate world reasoning in video generation by reframing the task as predicting world evolution from initial states and action inputs. They introduce WorldRewardBench, comprising 436 structured question-answer test cases and 6K expert preference pairs. The framework supports both pointwise and pairwise reward modeling through process-aware validation—encompassing structured QA and causal diagnostics—and multidimensional quality assessment covering reasoning fidelity, temporal consistency, and visual aesthetics. Experiments reveal that prevailing models often produce visually realistic yet world-reasoning-deficient outputs, particularly faltering in dynamics, causality, and information preservation.

📝 Abstract

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.

Problem

Research questions and friction points this paper is trying to address.

video generation

world-state prediction

reasoning benchmark

temporal consistency

causal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

world-state prediction

reasoning benchmark

video generation evaluation