🤖 AI Summary
Existing audio-visual foundation models predominantly treat audio as a supplementary modality to vision, overlooking its intrinsic semantic, affective, and event-related information. Method: We introduce AudioVQA, the first audio-centric benchmark for video understanding, comprising 2,662 rich-audio videos across 18 categories and over 13,000 human-annotated question-answer pairs. It establishes an “audio-centric evaluation paradigm” featuring a multi-dimensional task taxonomy spanning audio-only comprehension and audio-visual joint reasoning. Contribution/Results: Through audio-visual alignment analysis, task-decoupled evaluation, and cross-model comparison, we systematically identify pervasive deficiencies in deep audio semantics modeling and audio-source–event association. AudioVQA is publicly released and has become a widely adopted, authoritative evaluation resource in the community.
📝 Abstract
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (ACVUBench) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. Specifically, ACVUBench incorporates 2,662 videos spanning 18 different domains with rich auditory information, together with over 13k high-quality human annotated or validated question-answer pairs. Moreover, ACVUBench introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos are available at https://github.com/lark-png/ACVUBench.