🤖 AI Summary
Existing video understanding benchmarks emphasize short-range frame matching, failing to assess models’ deep multimodal capabilities—particularly long-range multi-frame localization, implicit causal reasoning, and high-reliability judgment. To address this gap, we propose MMR-V, the first benchmark dedicated to multimodal deep reasoning over videos. It innovatively incorporates (1) long-range multi-frame evidence tracing, (2) perception-agnostic question design, (3) human-driven, real-user-aligned annotations, and (4) shortcut-resistant confounding annotations. Comprising 317 videos and 1,257 tasks, MMR-V covers challenging scenarios including implicit information inference, cross-frame logical reasoning, and causal reasoning. Experiments reveal that even the state-of-the-art model o4-mini achieves only 52.5% accuracy; conventional chain-of-thought prompting and computational scaling yield marginal gains—demonstrating both the task’s inherent difficulty and MMR-V’s strong discriminative power for evaluating true multimodal reasoning.
📝 Abstract
The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as"question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.