🤖 AI Summary
This work addresses the critical gap in existing speech recognition benchmarks, which lack evaluation of African languages under real-world noise conditions and fine-grained domain-specific settings, leading to poor generalization of models in low-resource scenarios. We introduce the first benchmark tailored to authentic African use cases, comprising unscripted field recordings meticulously categorized into ten vertical domains—including government, finance, and healthcare—with a focused assessment on digit and named entity recognition performance. Through systematic evaluation of state-of-the-art models such as Sahara-v2, Gemini 3 Flash, and Omnilingual CTC, we uncover substantial performance degradation in the presence of complex acoustic noise and domain-specific terminology. This benchmark establishes a realistic evaluation framework for African language speech recognition and provides a reliable foundation for developing localized speech AI systems.
📝 Abstract
Recent large language models (LLMs) show strong speech recognition and translation capabilities for high-resource languages. However, African languages remain dramatically underrepresented in benchmarks, limiting their practical use in low-resource settings. While early benchmarks tested African languages and accents, they lacked exhaustive real-world noise and granular domain evaluations. We present AfriVox-v2, a comprehensive benchmark designed to test speech models under realistic African deployment conditions. AfriVox-v2 introduces "in the wild" unscripted audio for all supported languages. We also introduce strict domain verticalization, evaluating model accuracy across ten sectors including government, finance, health, and agriculture and conducting targeted tests on numbers and named entities. Finally, we benchmark a new generation of speech models, including Sahara-v2, Gemini 3 Flash, and the Omnilingual CTC models. Our results expose the true generalization gap of modern speech models in specialized, noisy African contexts and provide a reliable blueprint for developers building localized voice AI.