Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of existing audio language models to over-rely on textual priors during inference, often neglecting critical acoustic information. By leveraging mechanistic interpretability, the authors identify a subset of attention heads that are particularly sensitive to audio inputs—dubbed “expert” heads—and propose a novel, parameter-free intervention strategy applied at inference time. Specifically, they construct an audio-versus-silence contrastive signal and apply representation steering at the final layer to guide the model toward more effective utilization of audio evidence. This approach significantly enhances the model’s responsiveness to audio inputs, yielding accuracy improvements of up to 8.0 percentage points on the MMAU benchmark for two Qwen-family audio language models, without any parameter updates.

Technology Category

Application Category

📝 Abstract
Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening''signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.
Problem

Research questions and friction points this paper is trying to address.

audio-language models
text dominance
multimodal grounding
audio under-utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-language models
mechanistic interpretability
audio-specialist attention heads
activation intervention
text dominance
🔎 Similar Papers
No similar papers found.
N
Neta Glazer
Bar-Ilan University, Ramat Gan, Israel
L
Lenny Aharon
Columbia University, New York, NY, USA
Ethan Fetaya
Ethan Fetaya
Bar-Ilan University
Machine learningComputer vision