🤖 AI Summary
Current generative video models lack systematic evaluation of zero-shot reasoning capabilities.
Method: We introduce the first multidimensional, verifiable, and reproducible video reasoning benchmark—covering structured problem solving, spatial cognition, pattern inference, and physical dynamics—built via hybrid synthetic and real-world image sequences to ensure unambiguous task definitions; we further propose the Chain-of-Frames reasoning analysis paradigm to quantitatively characterize the impact of temporal sequence length on reasoning performance.
Results: Evaluating six state-of-the-art video generation models reveals substantial capability disparities across dimensions and pervasive hallucination patterns, particularly in physical and spatial reasoning. Our benchmark provides an empirically grounded, scalable framework for probing model reasoning mechanisms and advancing human-aligned video understanding.
📝 Abstract
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.