🤖 AI Summary
To address the high cost of manual phonetic annotation and low data efficiency in multilingual and cross-lingual automatic speech recognition (MCL-ASR), this paper proposes Whistle: a framework that leverages a LanguageNet-based grapheme-to-phoneme (G2P) model to automatically generate International Phonetic Alphabet (IPA) transcriptions as weak supervision—eliminating the need for human verification. It is the first work to systematically validate the efficacy of weak phonetic supervision in MCL-ASR. Whistle introduces a multilingual shared encoder jointly pretrained with IPA modeling and conducts unified evaluation across phonemic, subword, and self-supervised paradigms on the CV-Lang10 benchmark. Experiments show an average 12.3% WER reduction across 10 seen languages; for 2 unseen languages, just one hour of data suffices to significantly outperform subword and self-supervised baselines. Moreover, Whistle mitigates catastrophic forgetting and accelerates training convergence by 40%. All code, models, and data are publicly released.
📝 Abstract
There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10.