Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing video understanding benchmarks inadequately assess model robustness and reasoning faithfulness. To address this limitation, this work introduces a challenging, multi-tiered evaluation benchmark featuring a progressive three-level task hierarchy and a group-based nonlinear scoring mechanism, comprehensively measuring models’ capabilities from visual aggregation to complex multimodal reasoning. The dataset was meticulously constructed through 3,300 person-hours of intensive human annotation, five rounds of quality control, and collaborative efforts involving 12 annotators and 50 reviewers, with an emphasis on reasoning consistency and coherence. Experimental results reveal a substantial performance gap between current state-of-the-art models—such as Gemini-3-Pro—and human experts, demonstrating that low-level perceptual errors propagate to constrain high-level reasoning and highlighting the critical role of textual cues in chain-of-thought reasoning.

Technology Category

Application Category

📝 Abstract

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

Problem

Research questions and friction points this paper is trying to address.

video understanding

benchmark

multimodal reasoning

model evaluation

temporal dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive tri-level hierarchy

group-based non-linear evaluation

video understanding benchmark