Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

📅 2022-11-02

🏛️ Neural Information Processing Systems

📈 Citations: 5

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the inefficiency and overfitting issues of large-scale self-supervised speech models (e.g., wav2vec 2.0) under edge-device resource constraints—particularly in multilingual and multi-task scenarios—this paper proposes S³-Router, a novel dynamic sparse routing framework that abandons conventional weight fine-tuning and instead optimizes only the inter-layer connection topology. Theoretically and empirically, we demonstrate for the first time that pruning ≤10% of connections yields superior downstream performance compared to full-parameter fine-tuning. S³-Router unifies several critical capabilities: efficient model adaptation, joint multilingual/multi-task modeling, ASR model pruning, and representation interpretability analysis. On low-resource ASR tasks, it achieves significant accuracy gains while drastically reducing inference FLOPs and memory footprint. The method is inherently deployment-friendly on edge devices and exhibits strong generalization and robustness across diverse domains and languages.

📝 Abstract

Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S$^3$-Router framework, which for the first time discovers that simply discarding no more than 10% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S$^3$-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.

Problem

Research questions and friction points this paper is trying to address.

Memory Efficient Computing

Multilingual Speech Processing

Data Bias Mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

S$^3$-Router

multilingual multitask optimization

novel fine-tuning method

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations