🤖 AI Summary
To address the low-resource challenge in classroom speech recognition—where abundant weakly labeled data coexists with scarce high-accuracy annotations—this paper proposes a Weakly Supervised Pre-training (WSP) paradigm. WSP first performs noise-robust pre-training on end-to-end ASR models (Conformer and Whisper) using 5,000 hours of inexpensive weak-label data, followed by fine-tuning on a small set of gold-standard transcriptions. To mitigate label noise, WSP integrates label smoothing with curriculum learning. Experiments demonstrate that WSP consistently outperforms state-of-the-art semi-supervised and self-supervised methods under both realistic and synthetic weak-label settings, achieving a 18–24% relative reduction in word error rate. The approach delivers an efficient, production-ready weakly supervised solution tailored for educational ASR applications.
📝 Abstract
Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.