Does Audio Matter for Modern Video-LLMs and Their Benchmarks?

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Current video-language model (VLM) evaluations severely underestimate the role of audio, as mainstream benchmarks are often solvable from single frames, rendering audio contributions negligible. Method: We propose an audio-effectiveness evaluation framework and introduce two challenging audio-sensitive benchmarks—AVQA-Hard and Music-AVQA-Hard—to expose the misalignment between existing benchmarks and real-world audiovisual understanding. Technically, we extend LLaVA-OneVision with Whisper for audio encoding and innovatively integrate the Mamba state-space model to compress redundant audio tokens, enabling efficient multimodal alignment and end-to-end audiovisual joint modeling. Results: Audio yields marginal gains on standard benchmarks but proves indispensable on audio-sensitive subsets. Our approach significantly improves inference efficiency while achieving state-of-the-art performance on the new benchmarks. Code, models, and benchmarks are publicly released.

Technology Category

Application Category

📝 Abstract

Modern multimodal large language models often claim "video understanding," yet most evaluations use muted videos or simply discard audio. We ask a direct question: how much does audio actually matter for contemporary Video-LLMs and the benchmarks that certify them? We audit widely used suites and observe that many items are even solvable from a single frame, rendering audio largely redundant. Building on LLaVA-OneVision architecture, we attach a speech/audio encoder (e.g., Whisper) and analyze when audio helps, while addressing audio token explosion with a lightweight Mamba-based state-space token compressor. We find that audio yields minimal gains on recent video benchmarks but is decisive on curated, audio-sensitive subsets. To enable faithful evaluation, we release AVQA-Hard and Music-AVQA-Hard, our model, and code. Our findings surface a growing gap between current academic practice and real-world expectations, and provide practical tools for scalable audio-visual Video-LLMs. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.

Problem

Research questions and friction points this paper is trying to address.

Assessing audio's importance for Video-LLMs and their evaluation benchmarks

Addressing audio token explosion with lightweight Mamba-based token compressor

Bridging the gap between academic practice and real-world audio-visual expectations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attached Whisper encoder for audio processing

Used Mamba-based token compressor for efficiency

Created AVQA-Hard benchmarks for audio evaluation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs