VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-video understanding (LVU) benchmarks suffer from two critical flaws: multiple-choice questions (MCQs) are highly susceptible to guessing, and questions often rely on external prior knowledge rather than video content, leading to biased evaluation. Method: We introduce RVLB—the first robust LVU benchmark—replacing MCQs with human-crafted open-ended short-answer questions that mandate holistic video comprehension. RVLB covers both snippet-level perception and cross-temporal reasoning tasks. It incorporates semantic completeness verification, multi-dimensional task decoupling, and systematic evaluation across 21 state-of-the-art video foundation models. Contribution/Results: Experiments reveal a >25% performance drop for mainstream models on open-ended questions versus MCQs; MCQ scores show no correlation with genuine understanding; and increased frame count yields statistically significant performance gains only under RVLB—demonstrating its validity, sensitivity, and resistance to superficial heuristics.

Technology Category

Application Category

📝 Abstract
Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance ($>$25%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.
Problem

Research questions and friction points this paper is trying to address.

Existing LVU benchmarks inflate performance due to guessable multiple-choice questions.
Many benchmark questions bypass video understanding via strong prior knowledge.
Current benchmarks lack validity in assessing true long-video comprehension.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-ended short-answer questions for realistic assessment
Segment-level and full-video understanding evaluation
Improved reliability with increased input frames
🔎 Similar Papers