ACVUBench: Audio-Centric Video Understanding Benchmark

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing audio-visual foundation models predominantly treat audio as a supplementary modality to vision, overlooking its intrinsic semantic, affective, and event-related information. Method: We introduce AudioVQA, the first audio-centric benchmark for video understanding, comprising 2,662 rich-audio videos across 18 categories and over 13,000 human-annotated question-answer pairs. It establishes an “audio-centric evaluation paradigm” featuring a multi-dimensional task taxonomy spanning audio-only comprehension and audio-visual joint reasoning. Contribution/Results: Through audio-visual alignment analysis, task-decoupled evaluation, and cross-model comparison, we systematically identify pervasive deficiencies in deep audio semantics modeling and audio-source–event association. AudioVQA is publicly released and has become a widely adopted, authoritative evaluation resource in the community.

Technology Category

Application Category

📝 Abstract

Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (ACVUBench) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. Specifically, ACVUBench incorporates 2,662 videos spanning 18 different domains with rich auditory information, together with over 13k high-quality human annotated or validated question-answer pairs. Moreover, ACVUBench introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos are available at https://github.com/lark-png/ACVUBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal LLMs' video comprehension with audio focus

Assesses understanding of audio content and audio-visual interactions

Identifies deficiencies in audio-visual LLMs across diverse domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-centric video understanding benchmark

2,662 videos with rich auditory information

Tests audio content and audio-visual interactions

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding