WAXAL: A Large-Scale Multilingual African Language Speech Corpus

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the persistent underrepresentation of low-resource languages in speech technology, which has exacerbated the digital divide for speakers of most languages in sub-Saharan Africa. To bridge this gap, the authors present WAXAL, a large-scale open-source speech corpus covering 21 African languages spoken by over 100 million people. The dataset comprises 1,250 hours of naturally recorded speech for automatic speech recognition (ASR) and 180 hours of high-quality audio for text-to-speech (TTS) synthesis. Developed through close collaboration with local academic and community partners, WAXAL employs phonetically balanced script design, field-based data collection, multi-tiered quality control, and rigorous ethical protocols. Released under a CC-BY-4.0 license on Hugging Face, this resource provides critical infrastructure for inclusive AI research and the development of equitable speech technologies for underserved linguistic communities.

Technology Category

Application Category

📝 Abstract
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
Problem

Research questions and friction points this paper is trying to address.

speech technology
digital divide
low-resource languages
Sub-Saharan African languages
language inclusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual speech corpus
low-resource languages
African languages
ASR dataset
TTS dataset
A
A. Diack
Google Research
P
Perry Nelson
Google Research
K
Kwaku Agbesi
Google Research
A
Angela Nakalembe
Google Research
M
Mohamedelfatih Mohamedkhair
Google Research
V
Vusumuzi Dube
Google Research
T
Tavonga Siyavora
Google Research
Subhashini Venugopalan
Subhashini Venugopalan
University of Texas at Austin
Natural Language ProcessingComputer VisionMachine Learning
Jason Hickey
Jason Hickey
Google Research
machine learningweatherclimateprogramming languagestheorem proving
U
Uche Okonkwo
Google Research
A
Abhishek Bapna
Google Research
I
I. Wiafe
University of Ghana
R
Raynard Dodzi Helegah
University of Ghana
E
E. D. Atsakpo
University of Ghana
C
Charles Nutrokpor
University of Ghana
F
Fiifi Baffoe Payin Winful
University of Ghana
K
Kafui Kwashie Solaga
University of Ghana
J
J. Abdulai
University of Ghana
A
A. Ekpezu
University of Ghana
A
Audace Niyonkuru
Digital Umuganda
S
Samuel Rutunda
Digital Umuganda
B
Boris Ishimwe
Digital Umuganda
M
Michael Melese
Addis Ababa University
Engineer Bainomugisha
Engineer Bainomugisha
Professor of Computer Science, Makerere University, Kampala
Programming LanguagesDistributed systemsReactive ProgrammingCloudAI and machine learning
J
Joyce Nakatumba‐Nabende
Makerere University
Andrew Katumba
Andrew Katumba
Lecturer, Makerere University
Machine LearningAI4 Social GoodAI in HealthPhotonicsNeuromorphic Computing
C
Claire Babirye
Makerere University
J
Jonathan Mukiibi
Makerere University
V
Vincent Kimani
Loud and Clear Communications Ltd
S
Samuel Kibacia
Loud and Clear Communications Ltd
J
James Maina
Loud and Clear Communications Ltd
F
Fridah Emmah
Loud and Clear Communications Ltd
A
Ahmed Ibrahim Shekarau
Media Trust Limited
I
Ibrahim Shehu Adamu
Media Trust Limited
Y
Yusuf Abdullahi
Media Trust Limited
H
Howard Lakougna
Gates Foundation
B
Bob MacDonald
Google Research
H
Hadar Shemtov
Google Research
A
Aisha Walcott-Bryant
Google Research
M
Moustapha Cissé
Google Research
A
Avinatan Hassidim
Google Research
Jeff Dean
Jeff Dean
Google Chief Scientist, Google Research and Google DeepMind
Distributed systemsArtificial Intelligencemachine learningcompilerscomputer architecture
Yossi Matias
Yossi Matias
Google