Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Oromo lacks high-quality automatic speech recognition (ASR) data, severely hindering ASR research for this low-resource language. To address this, we introduce the first open-source Oromo ASR dataset—comprising 100 hours of real-world speech from multiple speakers, capturing phonetic variation and spanning both clean and noisy acoustic conditions. Methodologically, we propose and evaluate Conformer-based models trained with hybrid CTC–attention or pure CTC loss, alongside fine-tuned Whisper variants, enhanced by speech preprocessing and forced alignment. Experimental results show the best Conformer achieves a word error rate (WER) of 15.32%, while fine-tuned Whisper attains a new state-of-the-art WER of 10.82%. This work establishes the first authoritative ASR baseline for Oromo, empirically validating the efficacy of large-model fine-tuning for low-resource ASR and providing a reproducible, transferable methodology for related under-resourced languages.

Technology Category

Application Category

📝 Abstract
We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://github.com/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing.
Problem

Research questions and friction points this paper is trying to address.

Oromo language
Automatic Speech Recognition (ASR)
Data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Oromo Language
Speech Recognition Database
Conformer and Whisper Models
🔎 Similar Papers
No similar papers found.
T
Turi Abu
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Ying Shi
Ying Shi
Syracuse University
Education PolicyRacial InequalityLabor Economics
Thomas Fang Zheng
Thomas Fang Zheng
Director of Center for Speech and Language Technologies (CSLT), Tsinghua University
Speech RecognitionSpeaker RecognitionNatural Language Understanding
D
Dong Wang
Center for Speech and Language Technologies, BNRist, Beijing