Swivuriso: The South African Next Voices Multilingual Speech Dataset

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality, multi-domain speech data for automatic speech recognition (ASR) in South Africa’s seven official languages is critically scarce, hindering robust ASR development for these low-resource languages. Method: We construct the first large-scale, multi-domain speech dataset covering all seven languages—totaling 3,000 hours—with content spanning agriculture, healthcare, and other domains. Data collection adheres to strict ethical guidelines; transcripts undergo rigorous human verification, and audio is processed via standardized preprocessing to ensure high fidelity and annotation accuracy. Contribution/Results: This dataset fills a critical gap in benchmark resources for indigenous low-resource ASR in South Africa, enabling end-to-end model training and cross-domain acoustic modeling. Extensive experiments across mainstream ASR architectures—including conformer, wav2vec 2.0, and Whisper—demonstrate that models trained or fine-tuned on our dataset consistently outperform those based on existing public benchmarks. The dataset provides a reproducible, scalable foundation for evaluation and development of ASR systems for under-resourced languages.

Technology Category

Application Category

📝 Abstract
This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
Problem

Research questions and friction points this paper is trying to address.

Addresses gaps in multilingual ASR datasets for South African languages
Supports ASR development and benchmarking across seven languages
Covers agriculture, healthcare, and general domain topics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual speech dataset for seven South African languages
Covers agriculture, healthcare, and general domain topics
Provides baseline ASR model training and benchmarking results
🔎 Similar Papers
No similar papers found.
V
Vukosi Marivatee
University of Pretoria
Kayode Olaleye
Kayode Olaleye
Unknown affiliation
speech and language processing
S
Sitwala Mundia
University of Pretoria
A
Andinda Bakainga
University of Pretoria
U
Unarine Netshifhefhe
University of Pretoria
M
Mahmooda Milanzie
University of Pretoria
T
T. H. Mogale
University of Pretoria
T
Thapelo Sindane
University of Pretoria
Z
Zainab Abdulrasaq
University of Pretoria
K
Kesego Mokgosi
Technological University Dublin
C
Chijioke I. Okorie
University of Pretoria
N
Nia Zion Van Wyk
University of Pretoria
G
Graham Morrissey
Way With Words, SADiLaR, Pennsylvania State University
D
Dale Dunbar
Way With Words, SADiLaR, Pennsylvania State University
F
Francois Smit
Way With Words, SADiLaR, Pennsylvania State University
T
Tsosheletso Chidi
University of Pretoria
R
Rooweither Mabuya
Lelapa AI
A
Andiswa Bukula
Elsewhere
R
Respect Mlambo
Lelapa AI
T
Tebogo Macucwa
University of Pretoria
Idris Abdulmumin
Idris Abdulmumin
Postdoctoral Fellow, DSFSI, University of Pretoria
Machine TranslationNeural Machine TranslationNatural Language ProcessingInternet Technology
S
Seani Rananga
University of Pretoria