AfriHuBERT: A self-supervised speech representation model for African languages

📅 2024-09-30
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Speech recognition (ASR) and language identification (LID) for low-resource African languages remain challenging due to data scarcity and linguistic diversity. Method: We introduce AfriHuBERT—the first large-scale, unified self-supervised speech representation model targeting 1,226 African languages (serving over 600 million speakers). Built upon the mHuBERT-147 architecture, it is pretrained via joint masked speech modeling (MSM) and contrastive learning on >10,000 hours of multilingual, multi-source African speech data. Contribution/Results: AfriHuBERT establishes the first cross-lingual speech representation framework spanning over one thousand African languages, significantly improving low-resource generalization and cross-corpus robustness. On the FLEURS benchmark, it achieves a +3.6% absolute improvement in LID F1-score and a −2.1% absolute reduction in average word error rate (WER) for ASR—matching or exceeding the performance of larger models such as MMS and XEUS, despite its more focused linguistic scope and efficient design.

Technology Category

Application Category

📝 Abstract
In this work, we present AfriHuBERT, an extension of mHuBERT-147, a compact self-supervised learning (SSL) model pretrained on 147 languages. While mHuBERT-147 covered 16 African languages, we expand this to 1,226 through continued pretraining on 10K+ hours of speech data from diverse sources, benefiting an African population of over 600M. We evaluate AfriHuBERT on two key speech tasks, Spoken Language Identification (SLID) and Automatic Speech Recognition (ASR), using the FLEURS benchmark. Our results show a +3.6% F1 score improvement for SLID and a -2.1% average Word Error Rate (WER) reduction for ASR over mHuBERT-147, and demonstrates competitiveness with larger SSL models such as MMS and XEUS. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization and are competitive in extremely low-resource ASR scenarios.
Problem

Research questions and friction points this paper is trying to address.

Extends mHuBERT-147 to cover 1,226 African languages
Improves Spoken Language Identification and Automatic Speech Recognition
Enhances cross-corpus generalization in low-resource ASR scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended mHuBERT-147 to 1,226 African languages
Used 10K+ hours diverse speech data
Improved SLID and ASR performance significantly
🔎 Similar Papers
No similar papers found.