Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
African languages remain systematically marginalized in mainstream large language models (LLMs), particularly in highly linguistically diverse countries such as Uganda. Method: We propose a region-centric language support paradigm—departing from fragmented, language-by-language fine-tuning—to enable unified modeling and efficient coverage of the vast majority of indigenous Ugandan languages. Leveraging the Qwen3 architecture, we develop the open-source Sunflower-14B/32B bilingual-scale models, integrating multilingual pretraining, cross-lingual transfer leveraging linguistic similarity, and open-data augmentation. Results: Empirical evaluation demonstrates substantial improvements in understanding across low-resource Ugandan languages, achieving state-of-the-art performance. The approach significantly mitigates language barriers in critical applications and establishes a reusable, systematic methodology for AI adaptation to African languages.

Technology Category

Application Category

📝 Abstract
There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.
Problem

Research questions and friction points this paper is trying to address.

Expanding African language coverage in large language models
Addressing piecemeal language support across disparate African languages
Developing regionally focused models for high linguistic diversity countries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Regionally focused approach for African language coverage
Models based on Qwen 3 architecture with enhancements
Open source models supporting Ugandan language applications
🔎 Similar Papers
No similar papers found.
B
Benjamin Akera
Sunbird AI, Uganda
E
Evelyn Nafula Ouma
Sunbird AI, Uganda
G
Gilbert Yiga
Sunbird AI, Uganda
P
Patrick Walukagga
Sunbird AI, Uganda
P
Phionah Natukunda
Sunbird AI, Uganda
T
Trevor Saaka
Sunbird AI, Uganda
S
Solomon Nsumba
Sunbird AI, Uganda
L
Lilian Teddy Nabukeera
Sunbird AI, Uganda
J
Joel Muhanguzi
Sunbird AI, Uganda
I
Imran Sekalala
Sunbird AI, Uganda
N
Nimpamya Janat Namara
Sunbird AI, Uganda
Engineer Bainomugisha
Engineer Bainomugisha
Professor of Computer Science, Makerere University, Kampala
Programming LanguagesDistributed systemsReactive ProgrammingCloudAI and machine learning
E
Ernest Mwebaze
Sunbird AI, Uganda
John Quinn
John Quinn
Google Research, Sunbird AI, Makerere University