🤖 AI Summary
Existing single-projector approaches struggle to effectively model the diverse acoustic-to-semantic mappings inherent in multilingual speech recognition. To address this limitation, this work proposes SMEAR-MoE, a dynamic mixture-of-experts (MoE) projector equipped with a stabilized routing mechanism that establishes a lightweight connection between a frozen speech encoder and a large language model. By employing a gradient-stabilized routing strategy, SMEAR-MoE ensures dense updates across all experts, thereby preventing expert collapse and facilitating cross-lingual knowledge sharing. Experimental results demonstrate that SMEAR-MoE achieves up to a 7.6% relative word error rate reduction on four Indian languages, matches the inference efficiency of baseline models, and induces spontaneous semantic specialization among experts according to linguistic relatedness.
📝 Abstract
Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.