🤖 AI Summary
Large Audio-Language Models (LALMs) suffer from audio–text attention imbalance, leading to insufficient utilization of audio cues and constrained audio reasoning performance. To address this, we propose MATA—a training-free, post-hoc intervention method that dynamically enhances model attention to audio tokens by recalibrating self-attention scores in the multimodal fusion layer, without introducing additional parameters or computational overhead. Crucially, MATA modifies only the attention distribution over the final token in intermediate layers, enabling zero-cost, cross-modal attention calibration within the original Transformer architecture. Evaluated on the MMAU and MMAR benchmarks, open-source LALMs enhanced with MATA achieve substantial performance gains—surpassing the proprietary Gemini 2.0 Flash model for the first time. These results validate the effectiveness and generalizability of lightweight attention intervention for multimodal audio–language understanding.
📝 Abstract
Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose extbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay extbf{M}ore extbf{A}ttention extbf{T}o extbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.