MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio language models employ dense, shared-parameter adapters to process heterogeneous audio modalities—such as speech, music, and environmental sounds—which often suffer from gradient conflicts that limit performance. To address this, this work proposes MoE-Adapter, the first audio adapter architecture based on sparse mixture-of-experts (MoE). It dynamically routes audio tokens to specialized experts via a gating mechanism to disentangle acoustic features, while retaining shared experts to preserve global contextual information. Under comparable computational costs, MoE-Adapter significantly outperforms dense baselines across both audio semantic understanding and paralinguistic tasks. The authors release code and models to support further research in this direction.

Technology Category

Application Category

📝 Abstract
Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. Furthermore, we will release the related code and models to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

audio modality
gradient conflict
heterogeneous acoustic information
parameter-shared adapter
multimodal perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
MoE-Adapter
gradient conflict
audio language models
sparse adaptation
🔎 Similar Papers
No similar papers found.
Y
Yishu Lei
ERNIE Team, Baidu
S
Shuwei He
ERNIE Team, Baidu; College of Computer Science, Inner Mongolia University
Jing Hu
Jing Hu
Associate professor, School of Computer Science and Engineering, Xi'an University of Technology
hyperspectral image processing
D
Dan Zhang
ERNIE Team, Baidu
X
Xianlong Luo
ERNIE Team, Baidu
D
Danxiang Zhu
ERNIE Team, Baidu
Shikun Feng
Shikun Feng
Baidu
nlp
R
Rui Liu
College of Computer Science, Inner Mongolia University
J
Jingzhou He
ERNIE Team, Baidu
Yu Sun
Yu Sun
Baidu
Natural Language ProcessingDeep Learning
H
Hua Wu
ERNIE Team, Baidu
Haifeng Wang
Haifeng Wang
Baidu
NLPMTSearchSpeechData Mining