MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

📅 2024-09-10

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing AudioLLMs rely on a single pre-trained audio encoder, whose fixed representation capacity limits generalization across diverse audio tasks. To address this, we propose the Mixture of Weak Encoders (MoWE), which replaces the monolithic encoder with multiple lightweight audio encoders and employs a learnable gating mechanism to dynamically activate a task-adaptive subset of them—while keeping the large language model frozen. This enables low-overhead, task-specific feature enhancement without architectural or parametric overhead on the LLM. MoWE introduces, for the first time, a “mixture of weak encoders” paradigm that breaks the representational bottleneck by jointly optimizing expressiveness and efficiency: it incurs less than 3% additional parameters while significantly increasing representational diversity. Evaluated on a cross-domain multi-task audio understanding benchmark, MoWE achieves an average accuracy gain of +4.2% over strong single-encoder baselines, demonstrating superior adaptability and robustness.

Technology Category

Application Category

📝 Abstract

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance feature extraction in AudioLLMs.

Improve multi-task performance with MoWE.

Broaden applicability to diverse audio tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Weak Encoders

Enhance feature extraction

Improve multi-task performance

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs