Swivuriso: The South African Next Voices Multilingual Speech Dataset

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

High-quality, multi-domain speech data for automatic speech recognition (ASR) in South Africa’s seven official languages is critically scarce, hindering robust ASR development for these low-resource languages. Method: We construct the first large-scale, multi-domain speech dataset covering all seven languages—totaling 3,000 hours—with content spanning agriculture, healthcare, and other domains. Data collection adheres to strict ethical guidelines; transcripts undergo rigorous human verification, and audio is processed via standardized preprocessing to ensure high fidelity and annotation accuracy. Contribution/Results: This dataset fills a critical gap in benchmark resources for indigenous low-resource ASR in South Africa, enabling end-to-end model training and cross-domain acoustic modeling. Extensive experiments across mainstream ASR architectures—including conformer, wav2vec 2.0, and Whisper—demonstrate that models trained or fine-tuned on our dataset consistently outperform those based on existing public benchmarks. The dataset provides a reproducible, scalable foundation for evaluation and development of ASR systems for under-resourced languages.

Technology Category

Application Category

📝 Abstract

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

Problem

Research questions and friction points this paper is trying to address.

Addresses gaps in multilingual ASR datasets for South African languages

Supports ASR development and benchmarking across seven languages

Covers agriculture, healthcare, and general domain topics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual speech dataset for seven South African languages

Covers agriculture, healthcare, and general domain topics

Provides baseline ASR model training and benchmarking results

🔎 Similar Papers

No similar papers found.