Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Audio-Language Models (LALMs) suffer from audio–text attention imbalance, leading to insufficient utilization of audio cues and constrained audio reasoning performance. To address this, we propose MATA—a training-free, post-hoc intervention method that dynamically enhances model attention to audio tokens by recalibrating self-attention scores in the multimodal fusion layer, without introducing additional parameters or computational overhead. Crucially, MATA modifies only the attention distribution over the final token in intermediate layers, enabling zero-cost, cross-modal attention calibration within the original Transformer architecture. Evaluated on the MMAU and MMAR benchmarks, open-source LALMs enhanced with MATA achieve substantial performance gains—surpassing the proprietary Gemini 2.0 Flash model for the first time. These results validate the effectiveness and generalizability of lightweight attention intervention for multimodal audio–language understanding.

Technology Category

Application Category

📝 Abstract
Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose extbf{MATA}, a novel training-free method that dynamically pushes LALMs to pay extbf{M}ore extbf{A}ttention extbf{T}o extbf{A}udio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA's effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.
Problem

Research questions and friction points this paper is trying to address.

Large Audio-Language Models prioritize text over acoustic information
Audio-textual attention imbalance hinders utilization of acoustic cues
Attention bias causes suboptimal performance on audio reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adjustment of audio-text attention scores
Training-free intervention in self-attention mechanism
Parameter-free enhancement of audio token weighting
🔎 Similar Papers
No similar papers found.
J
Junyu Wang
Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Z
Ziyang Ma
School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
Z
Zhengding Luo
School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore
Tianrui Wang
Tianrui Wang
Tianjin University
Speech Signal Processing
Meng Ge
Meng Ge
Tianjin University; CUHK-Shenzhen; National University of Singapore
Xiaobao Wang
Xiaobao Wang
天津大学 Associate Professor
人工智能,大模型生成安全,图机器学习
Longbiao Wang
Longbiao Wang
Professor, Tianjin University
Speech ProcessingSpeech recognitionspeaker recognitionacoustic signal processingspeech enhancement