GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

📅 2024-06-17
🏛️ arXiv.org
📈 Citations: 8
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance limitations of automatic speech recognition (ASR) for low-resource languages caused by scarce labeled data, this paper proposes an end-to-end automated multilingual corpus construction framework. Leveraging unlabeled YouTube videos, it integrates Whisper-based initial transcription, MMS-based forced alignment, and multi-dimensional quality filtering to generate high-fidelity pseudo-labels. We further introduce an enhanced Noisy Student self-training paradigm to enable iterative model refinement. To our knowledge, this is the first work to significantly outperform Whisper-large-v3 and leading commercial ASR systems using a lightweight model with only 10% of their parameter count. On real-world YouTube test sets for Thai, Indonesian, and Vietnamese, our approach achieves average WER reductions of 25–40%. The framework establishes a novel, efficient, and scalable paradigm for both data curation and modeling in low-resource ASR.

Technology Category

Application Category

📝 Abstract
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline involves Whisper for initial transcription, MMS for forced alignment, and multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thereby enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus's high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to Whisper large-v3, with merely 10% model parameters. Furthermore, our ASR models trained on GigaSpeech 2 yield superior performance compared to commercial services. We hope that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area.
Problem

Research questions and friction points this paper is trying to address.

Addresses low-resource language ASR with automated data pipeline
Reduces word error rates for Thai, Indonesian, Vietnamese
Enhances model performance with iterative pseudo-label refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for data crawling and transcription
Multi-dimensional filtering for quality assurance
Modified Noisy Student Training for label refinement
🔎 Similar Papers
No similar papers found.
Y
Yifan Yang
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Z
Zheshu Song
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
J
Jianheng Zhuo
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Mingyu Cui
Mingyu Cui
The Chinese University of Hong Kong
Speech RecognitionMachine Learning
J
Jinpeng Li
Dept EE, Tsinghua University
B
Bo Yang
Peng Cheng Laboratory
Yexing Du
Yexing Du
Harbin Institute of Technology
Z
Ziyang Ma
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University
Xunying Liu
Xunying Liu
Chinese University of Hong Kong
Speech and Language ProcessingMachine Learning
Z
Ziyuan Wang
Birch AI
K
Ke Li
Dataocean AI
Shuai Fan
Shuai Fan
AISpeech Ltd
K
Kai Yu
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University, AISpeech Ltd
W
Wei-Qiang Zhang
Dept EE, Tsinghua University, SpeechColab
Guoguo Chen
Guoguo Chen
Seasalt AI Inc, SpeechColab
X
Xie Chen
MoE Key Lab of Artificial Intelligence, X-LANCE Lab, Shanghai Jiao Tong University, SpeechColab