Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

African languages remain systematically marginalized in mainstream large language models (LLMs), particularly in highly linguistically diverse countries such as Uganda. Method: We propose a region-centric language support paradigm—departing from fragmented, language-by-language fine-tuning—to enable unified modeling and efficient coverage of the vast majority of indigenous Ugandan languages. Leveraging the Qwen3 architecture, we develop the open-source Sunflower-14B/32B bilingual-scale models, integrating multilingual pretraining, cross-lingual transfer leveraging linguistic similarity, and open-data augmentation. Results: Empirical evaluation demonstrates substantial improvements in understanding across low-resource Ugandan languages, achieving state-of-the-art performance. The approach significantly mitigates language barriers in critical applications and establishes a reusable, systematic methodology for AI adaptation to African languages.

Technology Category

Application Category

📝 Abstract

There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.

Problem

Research questions and friction points this paper is trying to address.

Expanding African language coverage in large language models

Addressing piecemeal language support across disparate African languages

Developing regionally focused models for high linguistic diversity countries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Regionally focused approach for African language coverage

Models based on Qwen 3 architecture with enhancements

Open source models supporting Ugandan language applications

🔎 Similar Papers

AfroBench: How Good are Large Language Models on African Languages?