WorldSpeech: A Multilingual Speech Corpus from Around the World

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the performance limitations of automatic speech recognition (ASR) for low-resource languages, which stem from a scarcity of publicly available aligned speech-text data. The authors present the first large-scale integration of multilingual aligned speech corpora derived from diverse open sources—including parliamentary proceedings, international broadcasts, and public-domain audiobooks—yielding a high-quality dataset spanning 76 languages and totaling 65,000 hours. Notably, most low-resource languages in this collection are enriched with 200 to 1,000 hours of labeled data, substantially mitigating existing resource gaps. Fine-tuning state-of-the-art ASR models on this corpus across 11 languages achieves an average relative word error rate reduction of 63.5%, demonstrating the critical role of both scale and quality in advancing low-resource ASR performance.
📝 Abstract
Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.
Problem

Research questions and friction points this paper is trying to address.

automatic speech recognition
low-resource languages
aligned audio-transcript data
multilingual speech corpus
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual speech corpus
automatic speech recognition
low-resource languages
aligned audio-transcript data
Word-Error-Rate reduction