🤖 AI Summary
African languages—particularly Nigeria’s three major languages (Igbo, Hausa, and Yoruba)—have long suffered from a severe scarcity of high-quality speech data, resulting in critically underdeveloped speech technologies for over one billion users. To address this, we introduce the largest culturally adapted speech dataset to date: 1,800 hours of audio from 5,000+ speakers, spanning diverse geographic regions, age groups, and accents, accompanied by rigorous quality control and annotation protocols. We propose a novel crowdsourcing paradigm explicitly designed to balance cultural representativeness, acoustic diversity, and scalability. Fine-tuning state-of-the-art ASR models—including Whisper, MMS, and XLSR—on our dataset yields average WER reductions of 75.86%, 52.06%, and 42.33%, respectively. This work marks the first systematic advancement in both accuracy and cross-dialect generalization for African language ASR.
📝 Abstract
The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.