Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

📅 2024-06-04
🏛️ IEEE Transactions on Audio, Speech, and Language Processing
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost of manual phonetic annotation and low data efficiency in multilingual and cross-lingual automatic speech recognition (MCL-ASR), this paper proposes Whistle: a framework that leverages a LanguageNet-based grapheme-to-phoneme (G2P) model to automatically generate International Phonetic Alphabet (IPA) transcriptions as weak supervision—eliminating the need for human verification. It is the first work to systematically validate the efficacy of weak phonetic supervision in MCL-ASR. Whistle introduces a multilingual shared encoder jointly pretrained with IPA modeling and conducts unified evaluation across phonemic, subword, and self-supervised paradigms on the CV-Lang10 benchmark. Experiments show an average 12.3% WER reduction across 10 seen languages; for 2 unseen languages, just one hour of data suffices to significantly outperform subword and self-supervised baselines. Moreover, Whistle mitigates catastrophic forgetting and accelerates training convergence by 40%. All code, models, and data are publicly released.

Technology Category

Application Category

📝 Abstract
There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10.
Problem

Research questions and friction points this paper is trying to address.

Improving multilingual speech recognition with phonetic supervision
Exploring weakly phonetic supervision for data-efficient training
Comparing phonetic, graphemic, and self-supervised pretraining approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly phonetic supervision for MCL-ASR
IPA transcription via G2P models
Data-efficient phoneme-based training
🔎 Similar Papers
S
Saierdaer Yusuyin
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
T
Te Ma
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
H
Hao Huang
School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
W
Wenbo Zhao
China Unicom (Guangdong) Industrial Internet Co., Ltd, Guangzhou 510555, China
Z
Zhijian Ou
Speech Processing and Machine Intelligence (SPMI) Lab, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China