🤖 AI Summary
This work addresses the performance limitations of automatic speech recognition (ASR) for low-resource languages, which stem from a scarcity of publicly available aligned speech-text data. The authors present the first large-scale integration of multilingual aligned speech corpora derived from diverse open sources—including parliamentary proceedings, international broadcasts, and public-domain audiobooks—yielding a high-quality dataset spanning 76 languages and totaling 65,000 hours. Notably, most low-resource languages in this collection are enriched with 200 to 1,000 hours of labeled data, substantially mitigating existing resource gaps. Fine-tuning state-of-the-art ASR models on this corpus across 11 languages achieves an average relative word error rate reduction of 63.5%, demonstrating the critical role of both scale and quality in advancing low-resource ASR performance.
📝 Abstract
Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.