🤖 AI Summary
High-quality, multi-domain speech data for automatic speech recognition (ASR) in South Africa’s seven official languages is critically scarce, hindering robust ASR development for these low-resource languages.
Method: We construct the first large-scale, multi-domain speech dataset covering all seven languages—totaling 3,000 hours—with content spanning agriculture, healthcare, and other domains. Data collection adheres to strict ethical guidelines; transcripts undergo rigorous human verification, and audio is processed via standardized preprocessing to ensure high fidelity and annotation accuracy.
Contribution/Results: This dataset fills a critical gap in benchmark resources for indigenous low-resource ASR in South Africa, enabling end-to-end model training and cross-domain acoustic modeling. Extensive experiments across mainstream ASR architectures—including conformer, wav2vec 2.0, and Whisper—demonstrate that models trained or fine-tuned on our dataset consistently outperform those based on existing public benchmarks. The dataset provides a reproducible, scalable foundation for evaluation and development of ASR systems for under-resourced languages.
📝 Abstract
This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.