What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of effective evaluation of causal intervention responsiveness in existing video generation models, as conventional benchmarks focus solely on the plausibility of individual videos and fail to assess physical consistency. The study proposes the first causal world model evaluation framework tailored for embodied scenarios, constructing 319 prompt pairs from real-world nuScenes and DROID datasets that differ only in a single physical variable. A four-dimensional scoring mechanism—Adherence, Physics, Environment, and Outcome (APEO)—is introduced to systematically evaluate output consistency under interventions. Experiments reveal that even state-of-the-art models achieve only 52% pairwise accuracy, while open-source models average around 28%. Performance strongly correlates with the visual salience of interventions, with subtle changes yielding success rates as low as 14.2%, underscoring significant limitations in current models’ causal reasoning capabilities.

📝 Abstract

Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model's output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

Problem

Research questions and friction points this paper is trying to address.

causal reasoning

world models

video generation

physical consistency

embodied scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal benchmark

world models

video generation