🤖 AI Summary
This study addresses the open question of whether current video diffusion models genuinely understand causality or merely fit temporal statistical patterns, noting a lack of effective evaluation in real-world scenarios. To this end, the authors propose YoCausal—a two-tier benchmark inspired by the cognitive science paradigm of “violation of expectation”—which leverages temporal inversion of real videos to generate zero-cost counterfactual samples. They introduce the Reverse Surprise Index and the Causality Cognition Index to disentangle temporal bias from causal understanding in natural videos for the first time. Applying this infinitely extensible evaluation protocol to 13 state-of-the-art video diffusion models reveals that, while these models can perceive the arrow of time, their causal reasoning capabilities remain far below human-level performance.
📝 Abstract
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.