Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding benchmarks inadequately assess model robustness and reasoning faithfulness. To address this limitation, this work introduces a challenging, multi-tiered evaluation benchmark featuring a progressive three-level task hierarchy and a group-based nonlinear scoring mechanism, comprehensively measuring models’ capabilities from visual aggregation to complex multimodal reasoning. The dataset was meticulously constructed through 3,300 person-hours of intensive human annotation, five rounds of quality control, and collaborative efforts involving 12 annotators and 50 reviewers, with an emphasis on reasoning consistency and coherence. Experimental results reveal a substantial performance gap between current state-of-the-art models—such as Gemini-3-Pro—and human experts, demonstrating that low-level perceptual errors propagate to constrain high-level reasoning and highlighting the critical role of textual cues in chain-of-thought reasoning.
📝 Abstract
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.
Problem

Research questions and friction points this paper is trying to address.

video understanding
benchmark
multimodal reasoning
model evaluation
temporal dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive tri-level hierarchy
group-based non-linear evaluation
video understanding benchmark
multimodal reasoning
human annotation pipeline
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30