Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Large Audio-Language Models (LALMs) suffer from audio–text attention imbalance, leading to insufficient utilization of audio cues and constrained audio reasoning performance. To address this, we propose MATA—a training-free, post-hoc intervention method that dynamically enhances model attention to audio tokens by recalibrating self-attention scores in the multimodal fusion layer, without introducing additional parameters or computational overhead. Crucially, MATA modifies only the attention distribution over the final token in intermediate layers, enabling zero-cost, cross-modal attention calibration within the original Transformer architecture. Evaluated on the MMAU and MMAR benchmarks, open-source LALMs enhanced with MATA achieve substantial performance gains—surpassing the proprietary Gemini 2.0 Flash model for the first time. These results validate the effectiveness and generalizability of lightweight attention intervention for multimodal audio–language understanding.

Technology Category

Application Category

📝 Abstract

Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose extbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay extbf{M}ore extbf{A}ttention extbf{T}o extbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.

Problem

Research questions and friction points this paper is trying to address.

Large Audio-Language Models prioritize text over acoustic information

Audio-textual attention imbalance hinders utilization of acoustic cues

Attention bias causes suboptimal performance on audio reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adjustment of audio-text attention scores

Training-free intervention in self-attention mechanism

Parameter-free enhancement of audio token weighting

🔎 Similar Papers

No similar papers found.