Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Existing large multimodal models (LMMs) lack systematic evaluation for video quality understanding. Method: We introduce VQ-Bench, the first dedicated multimodal benchmark for this task, covering three video sources—natural, AI-generated content (AIGC), and computer graphics (CG)—and four question types: Yes/No, What-How, open-ended QA, and pairwise quality comparison. It is the first benchmark to incorporate AIGC-specific distortion dimensions. We propose a systematic evaluation framework featuring a multi-granularity QA design and cross-source sampling, validated via expert annotation to yield 2,378 high-quality QA pairs. Contribution/Results: Comprehensive evaluation across 17 state-of-the-art LMMs reveals a substantial performance gap between model and human capabilities in video quality understanding. VQ-Bench establishes the first reproducible benchmark for this task and identifies concrete directions for future improvement.

Technology Category

Application Category

📝 Abstract

With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source&5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LMMs' ability to understand video quality.

Introduces diverse video sources including AIGC and CG.

Expands evaluation to include AIGC distortions and open-ended questions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse video sources: natural, AIGC, CG.

Open-ended questions for complex scenarios.

Expanded evaluation: AIGC distortions included.

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs