🤖 AI Summary
It remains unclear whether current video world models can accurately simulate the natural evolution of physical states—such as water flow or ice melting—in the absence of observations. To address this, this work proposes STEVO-Bench, a benchmark that systematically disentangles state evolution from the observation process. By employing controlled interventions—including occlusion, turning off lights, and removing the camera—it evaluates models’ ability to maintain coherent state dynamics when observations are missing. Coupled with automated failure-mode detection and multidimensional evolution analysis, experiments reveal that prevailing models struggle to achieve observation-invariant state evolution, exposing fundamental limitations in both their training data and architectural design. These findings provide critical diagnostic insights for the future development of robust world models.
📝 Abstract
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/