Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

📅 2022-11-02
🏛️ Neural Information Processing Systems
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency and overfitting issues of large-scale self-supervised speech models (e.g., wav2vec 2.0) under edge-device resource constraints—particularly in multilingual and multi-task scenarios—this paper proposes S³-Router, a novel dynamic sparse routing framework that abandons conventional weight fine-tuning and instead optimizes only the inter-layer connection topology. Theoretically and empirically, we demonstrate for the first time that pruning ≤10% of connections yields superior downstream performance compared to full-parameter fine-tuning. S³-Router unifies several critical capabilities: efficient model adaptation, joint multilingual/multi-task modeling, ASR model pruning, and representation interpretability analysis. On low-resource ASR tasks, it achieves significant accuracy gains while drastically reducing inference FLOPs and memory footprint. The method is inherently deployment-friendly on edge devices and exhibits strong generalization and robustness across diverse domains and languages.
📝 Abstract
Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S$^3$-Router framework, which for the first time discovers that simply discarding no more than 10% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S$^3$-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.
Problem

Research questions and friction points this paper is trying to address.

Memory Efficient Computing
Multilingual Speech Processing
Data Bias Mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

S$^3$-Router
multilingual multitask optimization
novel fine-tuning method
🔎 Similar Papers
No similar papers found.
Yonggan Fu
Yonggan Fu
NVIDIA Research
Efficient AIEfficient Language ModelsModel Compression
Y
Yang Zhang
MIT-IBM Watson AI Lab
Kaizhi Qian
Kaizhi Qian
MIT-IBM Watson AI Lab
speech processingdeep learning
Z
Zhifan Ye
Rice University
Zhongzhi Yu
Zhongzhi Yu
Nvidia Research
C
Cheng-I Lai
MIT CSAIL
Y
Yingyan Lin
Georgia Institute of Technology