Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current video understanding benchmarks conflate temporal reasoning with static visual and linguistic priors, leading to inflated and misleading evaluations of video large language models’ (V-LLMs) dynamic comprehension capabilities. To address this, we propose VBenchComp—the first decoupled evaluation framework that systematically categorizes questions into three orthogonal dimensions: LLM-Answerable (factual knowledge), Semantic (spatial semantics), and Temporal (temporal reasoning). It employs automated answerability analysis, frame-order sensitivity testing, and semantic consistency verification for fine-grained, quantitative assessment. Experiments reveal that state-of-the-art V-LLMs exhibit significant deficiencies in Temporal reasoning, while conventional holistic metrics substantially overestimate their temporal competence. VBenchComp provides a reproducible diagnostic toolkit and establishes a new benchmarking paradigm grounded in capability disentanglement and causal attribution, advancing video understanding evaluation toward principled, interpretable assessment.

Technology Category

Application Category

📝 Abstract

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks mix knowledge and image questions, obscuring temporal reasoning.

Models exploit language priors and shuffling invariance, skewing true video understanding.

VBenchComp categorizes questions to isolate and evaluate temporal reasoning accurately.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline categorizes video questions

Isolates temporal reasoning from knowledge-based questions

Fine-grained evaluation of video LLM capabilities

🔎 Similar Papers

TVBench: Redesigning Video-Language Evaluation