🤖 AI Summary
This work addresses the persistent underrepresentation of low-resource languages in speech technology, which has exacerbated the digital divide for speakers of most languages in sub-Saharan Africa. To bridge this gap, the authors present WAXAL, a large-scale open-source speech corpus covering 21 African languages spoken by over 100 million people. The dataset comprises 1,250 hours of naturally recorded speech for automatic speech recognition (ASR) and 180 hours of high-quality audio for text-to-speech (TTS) synthesis. Developed through close collaboration with local academic and community partners, WAXAL employs phonetically balanced script design, field-based data collection, multi-tiered quality control, and rigorous ethical protocols. Released under a CC-BY-4.0 license on Hugging Face, this resource provides critical infrastructure for inclusive AI research and the development of equitable speech technologies for underserved linguistic communities.
📝 Abstract
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.