CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
It remains unclear whether current video prediction models genuinely understand the causal structure of the physical world or merely exploit superficial visual correlations. To address this, this work proposes CRONOS—a counterfactual evaluation benchmark based on interventions—introducing the first controllable manipulations of viewpoint, scene layout, object category, and appearance within a high-fidelity Unreal Engine environment. This framework establishes a reproducible standard for assessing physical consistency in video prediction. Experimental results demonstrate that state-of-the-art video generation models exhibit significant performance degradation under these interventions, revealing a fundamental deficiency in their capacity for true physical causal reasoning. These findings underscore the need for future models to incorporate explicit mechanisms for causal understanding of physical dynamics.
📝 Abstract
Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.
Problem

Research questions and friction points this paper is trying to address.

counterfactual physical consistency
video prediction
causal structure
intervention-based benchmark
world models
Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual physical consistency
intervention-based benchmark
video prediction
causal reasoning
Unreal Engine simulation
🔎 Similar Papers
No similar papers found.