Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitation of single-audio-encoder architectures in speech large language models (LLMs)—which struggle to simultaneously optimize semantic understanding tasks (e.g., automatic speech recognition, audio captioning) and acoustic modeling tasks (e.g., speaker count verification)—this paper proposes Prompt-aware Multi-encoder Mixing (PaM). PaM employs a prompt-driven gating mechanism to dynamically route input audio to the most task-appropriate encoder and introduces a task-aware feature fusion strategy—replacing naive concatenation or averaging—to integrate heterogeneous encoder outputs. Unlike prior approaches, PaM enables a single unified speech LLM to achieve state-of-the-art performance across diverse downstream tasks—including ASR, speaker count verification, and audio captioning—outperforming all single-encoder baselines and conventional feature fusion methods. This work marks the first demonstration of a unified architecture attaining optimal performance across such heterogeneous speech tasks, thereby overcoming the fundamental constraints of the traditional single-encoder paradigm.

Technology Category

Application Category

📝 Abstract
Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging.
Problem

Research questions and friction points this paper is trying to address.

Integrates multiple audio encoders with LLMs.
Enhances task-specific audio feature extraction.
Improves performance across diverse audio understanding tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt-aware Mixture enhances Speech LLM
Multiple audio encoders extract task-specific features
PaM surpasses single-encoder performances in tasks
🔎 Similar Papers
No similar papers found.
W
Weiqiao Shan
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yuang Li
Yuang Li
2012 Lab, Huawei
SpeechNLP
Y
Yuhao Zhang
The Chinese University of Hong Kong, Shenzhen, China
Y
Yingfeng Luo
School of Computer Science and Engineering, Northeastern University, Shenyang, China
C
Chen Xu
College of Computer Science and Technology, Harbin Engineering University, Harbin, China
X
Xiaofeng Zhao
Huawei Translation Services Center, Beijing, China
L
Long Meng
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yunfei Lu
Yunfei Lu
Huawei
Large Language ModelMachine TranslationData Mining
M
Min Zhang
Huawei Translation Services Center, Beijing, China
H
Hao Yang
Huawei Translation Services Center, Beijing, China
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing