🤖 AI Summary
African languages are severely underrepresented in multilingual speech processing; existing open-source models are typically small-scale, low-performing, and lack systematic investigation into the joint impact of model scale and data composition in low-resource settings. Method: We introduce SSA-HuBERT—the first self-supervised HuBERT model family (Large/XL, 317M/964M parameters) trained exclusively on large-scale African speech data. Contribution/Results: Through systematic ablation studies, we empirically demonstrate that scaling model capacity specifically for African speech significantly improves downstream automatic speech recognition (ASR) and language identification (LID) performance—revealing a positive synergistic effect between data diversity and model scale. All models are publicly released, establishing a strong foundational backbone and reproducible benchmark for African language speech technologies.
📝 Abstract
Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see https://huggingface.co/collections/Orange/african-speech-foundation-models. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.