All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the concern that high performance of current audio-language models on standard benchmarks may stem from reliance on textual priors rather than genuine utilization of acoustic signals for auditory comprehension. To investigate this, the authors propose a diagnostic framework that systematically quantifies the extent to which models actually depend on audio inputs. Through ablation studies and segment-level audio analysis, they evaluate eight prominent audio-language models across three major benchmarks. Their findings reveal that models retain 60–72% of their original accuracy even without any audio input, and only 3.0–4.2% of questions genuinely require the full audio signal—most tasks can be solved using brief local segments. These results challenge prevailing evaluation paradigms and expose a critical limitation: existing benchmarks inadequately assess true auditory reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.

Problem

Research questions and friction points this paper is trying to address.

audio-language models

text priors

audio reliance

benchmark evaluation

auditory understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-language models

text priors

audio reliance