MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing audio language models employ dense, shared-parameter adapters to process heterogeneous audio modalities—such as speech, music, and environmental sounds—which often suffer from gradient conflicts that limit performance. To address this, this work proposes MoE-Adapter, the first audio adapter architecture based on sparse mixture-of-experts (MoE). It dynamically routes audio tokens to specialized experts via a gating mechanism to disentangle acoustic features, while retaining shared experts to preserve global contextual information. Under comparable computational costs, MoE-Adapter significantly outperforms dense baselines across both audio semantic understanding and paralinguistic tasks. The authors release code and models to support further research in this direction.

Technology Category

Application Category

📝 Abstract

Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. Furthermore, we will release the related code and models to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

audio modality

gradient conflict

heterogeneous acoustic information

parameter-shared adapter

multimodal perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

MoE-Adapter

gradient conflict