The State of Large Language Models for African Languages: Progress and Challenges

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the current support for approximately 2,000 low-resource African languages in state-of-the-art language models, revealing severe coverage gaps: only 42 languages are supported (<2% coverage), and only Latin, Arabic, and Ge’ez scripts are recognized—20 other active scripts remain entirely unsupported. Methodologically, we conduct the first comprehensive audit across six large language models (LLMs), eight small models, and six domain-specific models, performing multi-dimensional comparative analysis—quantifying language coverage, training data availability, script compatibility, and technical bottlenecks—along with root-cause diagnosis. Key contributions include: (1) identifying Amharic, Swahili, Afrikaans, and Malagasy as the only four consistently supported African languages; (2) cataloging 23 publicly available datasets; and (3) diagnosing four fundamental challenges—data scarcity, tokenization bias, prohibitive computational costs, and absence of standardized evaluation frameworks—thereby establishing a foundational benchmark and actionable roadmap for advancing AI fairness for African languages.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa's 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 42 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge'ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, computational costs being very high, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages.
Problem

Research questions and friction points this paper is trying to address.

Evaluates African language coverage in LLMs, SLMs, and SSLMs
Identifies lack of support for 98% of African languages
Highlights challenges like data scarcity and script neglect
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative analysis of African language coverage
Identifies 42 supported African languages
Highlights lack of data and tokenization biases