Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address task interference caused by hard parameter sharing in speech-to-text multitask learning, this paper proposes a Supervised Mixture-of-Experts (S-MoE) architecture for joint modeling of automatic speech recognition (ASR) and speech translation (ST), supporting mixed-bandwidth inputs. Unlike conventional MoE approaches, S-MoE eliminates learnable gating mechanisms and instead employs task-specific guidance tokens to directly route representations to dedicated feed-forward experts, thereby decoupling task representations and eliminating parameter competition. Both encoder and decoder integrate independent expert subnetworks, enabling fine-grained task isolation and parallel optimization. Evaluated on standard multitask benchmarks, S-MoE achieves a 6.35% relative reduction in word error rate (WER), significantly improving the synergy between ASR and ST. This work introduces an efficient, interpretable, and gating-free paradigm for speech multitask modeling.

Technology Category

Application Category

📝 Abstract
Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.
Problem

Research questions and friction points this paper is trying to address.

Addresses task interference in multi-task speech-to-text models
Proposes Supervised Mixture of Experts (S-MoE) for task routing
Improves performance on mixed-bandwidth ASR and speech translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Mixture of Experts (S-MoE) model
Guiding tokens for expert routing
Separate feedforward networks per task
🔎 Similar Papers
No similar papers found.
H
Hojun Jin
Samsung Research, Korea
E
Eunsoo Hong
Samsung Research, Korea
Z
Ziwon Hyung
Samsung Research, Korea
Sungjun Lim
Sungjun Lim
Yonsei University
Bayesian Neural NetworkOptimizationModel Merging
S
Seungjin Lee
Samsung Research, Korea
K
Keunseok Cho
Samsung Research, Korea