On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

To address the challenges of zero-shot adaptation and real-time speaker customization in dysarthric speech recognition, this paper proposes a Mixture-of-Experts (MoE)-based speaker adaptation framework. It dynamically routes input utterances to specialized experts based on speaker-specific dysarthria severity and gender, integrating domain knowledge for plug-and-play adaptation. We introduce a novel KL-divergence–constrained expert diversity regularization mechanism to enhance generalization to unseen speakers. To our knowledge, this is the first work to successfully deploy an MoE architecture under simultaneous zero-shot and real-time constraints. Evaluated on the UASpeech corpus, our method achieves an absolute 1.34% WER reduction over baseline models and further improves upon the best batch-mode adaptation by 2.55% WER, while accelerating inference by 7×. It attains the state-of-the-art overall WER of 16.35% and 46.77% on the extremely low-intelligibility subset.

Technology Category

Application Category

📝 Abstract

This paper proposes a novel MoE-based speaker adaptation framework for foundation models based dysarthric speech recognition. This approach enables zero-shot adaptation and real-time processing while incorporating domain knowledge. Speech impairment severity and gender conditioned adapter experts are dynamically combined using on-the-fly predicted speaker-dependent routing parameters. KL-divergence is used to further enforce diversity among experts and their generalization to unseen speakers. Experimental results on the UASpeech corpus suggest that on-the-fly MoE-based adaptation produces statistically significant WER reductions of up to 1.34% absolute (6.36% relative) over the unadapted baseline HuBERT/WavLM models. Consistent WER reductions of up to 2.55% absolute (11.44% relative) and RTF speedups of up to 7 times are obtained over batch-mode adaptation across varying speaker-level data quantities. The lowest published WER of 16.35% (46.77% on very low intelligibility) is obtained.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot adaptation for dysarthric speech recognition

Dynamic routing for speaker-dependent parameter prediction

Improving WER and speed over unadapted baseline models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot MoE adaptation for dysarthric speech

Dynamic routing with severity and gender conditions

KL-divergence enhances expert diversity and generalization

🔎 Similar Papers

No similar papers found.