🤖 AI Summary
African languages remain systematically marginalized in mainstream large language models (LLMs), particularly in highly linguistically diverse countries such as Uganda. Method: We propose a region-centric language support paradigm—departing from fragmented, language-by-language fine-tuning—to enable unified modeling and efficient coverage of the vast majority of indigenous Ugandan languages. Leveraging the Qwen3 architecture, we develop the open-source Sunflower-14B/32B bilingual-scale models, integrating multilingual pretraining, cross-lingual transfer leveraging linguistic similarity, and open-data augmentation. Results: Empirical evaluation demonstrates substantial improvements in understanding across low-resource Ugandan languages, achieving state-of-the-art performance. The approach significantly mitigates language barriers in critical applications and establishes a reusable, systematic methodology for AI adaptation to African languages.
📝 Abstract
There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.