🤖 AI Summary
Existing video understanding benchmarks evaluate only final answers, lacking supervision over intermediate reasoning steps and thus failing to assess whether models genuinely integrate temporal and visual information. Method: We introduce VidReason, the first video understanding benchmark featuring human-written multi-step reasoning chains, covering diverse domains, variable-length videos, and complex spatio-temporal–visual joint reasoning tasks. We propose a fine-grained reasoning trajectory annotation schema and error attribution framework, establish the first taxonomy of video reasoning failure modes, and validate inter-annotator agreement between human experts and LLM-as-judge for reasoning chain evaluation. Results: Experiments reveal that state-of-the-art multimodal models primarily fail due to temporal localization and visual perception deficits—not logical errors—and VidReason demonstrates strong discriminative power across both open- and closed-source models. All data, annotation guidelines, and evaluation tools are publicly released.
📝 Abstract
Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file#minerva.