Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

📅 2024-06-04

🏛️ IEEE Transactions on Audio, Speech, and Language Processing

📈 Citations: 2

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the high cost of manual phonetic annotation and low data efficiency in multilingual and cross-lingual automatic speech recognition (MCL-ASR), this paper proposes Whistle: a framework that leverages a LanguageNet-based grapheme-to-phoneme (G2P) model to automatically generate International Phonetic Alphabet (IPA) transcriptions as weak supervision—eliminating the need for human verification. It is the first work to systematically validate the efficacy of weak phonetic supervision in MCL-ASR. Whistle introduces a multilingual shared encoder jointly pretrained with IPA modeling and conducts unified evaluation across phonemic, subword, and self-supervised paradigms on the CV-Lang10 benchmark. Experiments show an average 12.3% WER reduction across 10 seen languages; for 2 unseen languages, just one hour of data suffices to significantly outperform subword and self-supervised baselines. Moreover, Whistle mitigates catastrophic forgetting and accelerates training convergence by 40%. All code, models, and data are publicly released.

Technology Category

Application Category

📝 Abstract

There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pretraining with phonetic or graphemic transcription, and self-supervised pretraining. We find that pretraining with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pretraining with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency. It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we release the code, models and data for the entire pipeline of Whistle at https://github.com/thu-spmi/CAT/tree/master/egs/cv-lang10.

Problem

Research questions and friction points this paper is trying to address.

Improving multilingual speech recognition with phonetic supervision

Exploring weakly phonetic supervision for data-efficient training

Comparing phonetic, graphemic, and self-supervised pretraining approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly phonetic supervision for MCL-ASR

IPA transcription via G2P models

Data-efficient phoneme-based training

🔎 Similar Papers

Cross-Lingual Transfer Learning for Speech Translation

2024-07-01arXiv.orgCitations: 0