Steer-MoE: Efficient Audio-Language Alignment with a Mixture-of-Experts Steering Module

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Full-model fine-tuning for audio–language modality alignment incurs prohibitive computational costs, while static adapters suffer from limited representational capacity. Method: We propose a parameter-efficient approach that freezes both the pretrained audio encoder and large language model (LLM), and trains only a lightweight, dynamically learnable Mixture-of-Experts (MoE) guidance module. This module adaptively reweights and transforms audio embeddings in continuous latent space to precisely align them with the LLM’s input distribution—without modifying the LLM’s architecture or vocabulary. Our method integrates MoE routing, continuous-space feature alignment, and frozen-backbone training. Contribution/Results: Experiments demonstrate state-of-the-art performance across diverse audio-language tasks—including automatic speech recognition, audio understanding, and function calling—while achieving high efficiency, strong generalization, and modular design.

Technology Category

Application Category

📝 Abstract
Aligning pretrained audio encoders and Large Language Models (LLMs) offers a promising, parameter-efficient path to building powerful multimodal agents. However, existing methods often require costly full-model finetuning or rely on static adapters that may lack expressive power. Drawing inspiration from the Platonic Representation Hypothesis, we introduce SteerMoE, a novel and modular framework for audio-language alignment. SteerMoE freezes both the audio encoder and the LLM decoder, training only a lightweight steering module integrated within the encoder's layers. This module uses a Mixture-of-Experts (MoE) router to dynamically select and apply learned steering vectors, progressively transforming continuous audio representations into a space comprehensible to the LLM. By operating entirely in the continuous embedding space, our approach requires no modifications to the LLM's vocabulary and preserves its advanced reasoning and agentic capabilities. We demonstrate through experiments on ASR, audio understanding, and a qualitative function-calling task that SteerMoE achieves strong performance while remaining highly modular and computationally efficient, offering a robust new paradigm for developing sophisticated audio-language systems.
Problem

Research questions and friction points this paper is trying to address.

Aligning audio encoders with LLMs efficiently
Avoiding costly full-model fine-tuning methods
Preserving LLM capabilities during audio-language integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes audio encoder and LLM decoder
Trains lightweight MoE steering module
Transforms audio representations for LLM comprehension
R
Ruitao Feng
Independent Researcher, China
B
Bixi Zhang
The University of Hong Kong, Fauclty of Science, Hong Kong
Sheng Liang
Sheng Liang
CIS LMU Munich & Munich Center for Machine Learning
NLP
Z
Zheng Yuan
Aix-Marseille University, Laboratoire Parole et Langage (LPL), France