MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding benchmarks emphasize short-range frame matching, failing to assess models’ deep multimodal capabilities—particularly long-range multi-frame localization, implicit causal reasoning, and high-reliability judgment. To address this gap, we propose MMR-V, the first benchmark dedicated to multimodal deep reasoning over videos. It innovatively incorporates (1) long-range multi-frame evidence tracing, (2) perception-agnostic question design, (3) human-driven, real-user-aligned annotations, and (4) shortcut-resistant confounding annotations. Comprising 317 videos and 1,257 tasks, MMR-V covers challenging scenarios including implicit information inference, cross-frame logical reasoning, and causal reasoning. Experiments reveal that even the state-of-the-art model o4-mini achieves only 52.5% accuracy; conventional chain-of-thought prompting and computational scaling yield marginal gains—demonstrating both the task’s inherent difficulty and MMR-V’s strong discriminative power for evaluating true multimodal reasoning.

Technology Category

Application Category

📝 Abstract
The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as"question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Challenges MLLMs in locating multi-frame evidence for video reasoning
Addresses lack of benchmarks for deep multimodal reasoning in videos
Evaluates models on long-range reasoning beyond direct perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-range multi-frame video reasoning benchmark
Beyond perception reasoning with hidden information
Manually annotated tasks with distractor strategies
🔎 Similar Papers
No similar papers found.
K
Kejian Zhu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Zhuoran Jin
Zhuoran Jin
Institute of Automation, Chinese Academy of Sciences
Large Language ModelsNatural Language ProcessingKnowledge Engineering
Hongbang Yuan
Hongbang Yuan
Institute of Automation, Chinese Academy of Sciences
Large Language ModelsNatural Language Processing
J
Jiachun Li
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Shangqing Tu
Shangqing Tu
Tsinghua University, graduate student
Trustworthy AILarge Language ModelAI for Education
P
Pengfei Cao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Yubo Chen
Yubo Chen
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingInformation ExtractionEvent ExtractionLarge Language Model
K
Kang Liu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Jun Zhao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences