MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing video understanding benchmarks emphasize short-range frame matching, failing to assess models’ deep multimodal capabilities—particularly long-range multi-frame localization, implicit causal reasoning, and high-reliability judgment. To address this gap, we propose MMR-V, the first benchmark dedicated to multimodal deep reasoning over videos. It innovatively incorporates (1) long-range multi-frame evidence tracing, (2) perception-agnostic question design, (3) human-driven, real-user-aligned annotations, and (4) shortcut-resistant confounding annotations. Comprising 317 videos and 1,257 tasks, MMR-V covers challenging scenarios including implicit information inference, cross-frame logical reasoning, and causal reasoning. Experiments reveal that even the state-of-the-art model o4-mini achieves only 52.5% accuracy; conventional chain-of-thought prompting and computational scaling yield marginal gains—demonstrating both the task’s inherent difficulty and MMR-V’s strong discriminative power for evaluating true multimodal reasoning.

Technology Category

Application Category

📝 Abstract

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as"question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Challenges MLLMs in locating multi-frame evidence for video reasoning

Addresses lack of benchmarks for deep multimodal reasoning in videos

Evaluates models on long-range reasoning beyond direct perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-range multi-frame video reasoning benchmark

Beyond perception reasoning with hidden information

Manually annotated tasks with distractor strategies

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs