MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

📅 2024-09-10
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing AudioLLMs rely on a single pre-trained audio encoder, whose fixed representation capacity limits generalization across diverse audio tasks. To address this, we propose the Mixture of Weak Encoders (MoWE), which replaces the monolithic encoder with multiple lightweight audio encoders and employs a learnable gating mechanism to dynamically activate a task-adaptive subset of them—while keeping the large language model frozen. This enables low-overhead, task-specific feature enhancement without architectural or parametric overhead on the LLM. MoWE introduces, for the first time, a “mixture of weak encoders” paradigm that breaks the representational bottleneck by jointly optimizing expressiveness and efficiency: it incurs less than 3% additional parameters while significantly increasing representational diversity. Evaluated on a cross-domain multi-task audio understanding benchmark, MoWE achieves an average accuracy gain of +4.2% over strong single-encoder baselines, demonstrating superior adaptability and robustness.

Technology Category

Application Category

📝 Abstract
The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance feature extraction in AudioLLMs.
Improve multi-task performance with MoWE.
Broaden applicability to diverse audio tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Weak Encoders
Enhance feature extraction
Improve multi-task performance
🔎 Similar Papers
No similar papers found.
W
Wenyu Zhang
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR)
Shuo Sun
Shuo Sun
Johns Hopkins University
B
Bin Wang
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR)
X
Xunlong Zou
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR)
Zhuohan Liu
Zhuohan Liu
Research Engineer
Y
Yingxu He
Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR)
Geyu Lin
Geyu Lin
Research Engineer, I2R, A*STAR
Generative AINLPSpeech
Nancy F. Chen
Nancy F. Chen
ISCA Fellow, AAIA Fellow, Multimodal Generative AI Group Leader, AI for Education Head at A*STAR
Agentic AILarge Language ModelsConversational AI
AiTi Aw
AiTi Aw
Aw Ai Ti