Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing single-projector approaches struggle to effectively model the diverse acoustic-to-semantic mappings inherent in multilingual speech recognition. To address this limitation, this work proposes SMEAR-MoE, a dynamic mixture-of-experts (MoE) projector equipped with a stabilized routing mechanism that establishes a lightweight connection between a frozen speech encoder and a large language model. By employing a gradient-stabilized routing strategy, SMEAR-MoE ensures dense updates across all experts, thereby preventing expert collapse and facilitating cross-lingual knowledge sharing. Experimental results demonstrate that SMEAR-MoE achieves up to a 7.6% relative word error rate reduction on four Indian languages, matches the inference efficiency of baseline models, and induces spontaneous semantic specialization among experts according to linguistic relatedness.

Technology Category

Application Category

📝 Abstract

Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.

Problem

Research questions and friction points this paper is trying to address.

multilingual speech recognition

acoustic-to-semantic mapping

projector

expert collapse

cross-lingual sharing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

multilingual ASR

stabilized routing