๐ค AI Summary
Existing large multimodal models (LMMs) lack systematic evaluation for video quality understanding. Method: We introduce VQ-Bench, the first dedicated multimodal benchmark for this task, covering three video sourcesโnatural, AI-generated content (AIGC), and computer graphics (CG)โand four question types: Yes/No, What-How, open-ended QA, and pairwise quality comparison. It is the first benchmark to incorporate AIGC-specific distortion dimensions. We propose a systematic evaluation framework featuring a multi-granularity QA design and cross-source sampling, validated via expert annotation to yield 2,378 high-quality QA pairs. Contribution/Results: Comprehensive evaluation across 17 state-of-the-art LMMs reveals a substantial performance gap between model and human capabilities in video quality understanding. VQ-Bench establishes the first reproducible benchmark for this task and identifies concrete directions for future improvement.
๐ Abstract
With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source&5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.