🤖 AI Summary
This work addresses the scarcity of high-quality large language models (LLMs) for Setswana, a low-resource African language. We introduce Pula, the first bilingual (English–Setswana) LLM series (1B–14B parameters) tailored to Setswana. Methodologically, we establish an end-to-end data engineering pipeline: (i) releasing Marothodi—the largest publicly available Setswana corpus to date; (ii) curating Medupi, the first Setswana instruction-tuning dataset; and (iii) proposing two novel evaluation benchmarks, MMLU-tsn and GSM8K-tsn. Leveraging the LLaMA architecture, we combine human translation, reformatting of public resources, and controllably synthesized data, augmented by rigorous filtering and a custom web-crawling toolkit, followed by supervised fine-tuning and instruction tuning. Experiments show that Pula-8B and Pula-14B outperform GPT-4o and Gemini 1.5 Pro on English–Setswana machine translation and Setswana-language reasoning tasks, achieving state-of-the-art performance at comparable scales. All model weights, training logs, source code, and the three datasets are fully open-sourced.
📝 Abstract
In this work we present Pula, a suite of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, Pula 8B and Pula 14B outperform GPT-4o and Gemini 1.5 Pro on English-Setswana translation tasks and achieve state-of-the-art performance on Setswana reasoning tasks for their size. We release the weights for Pula 1B, 3B, 8B, and 14B as well as training logs and training and evaluation code. Alongside Pula, we release the largest-ever Setswana text corpus, Marothodi, and the first comprehensive Setswana instruction-tuning dataset, Medupi, consisting of reformatted datasets, translated corpora, and synthetic LLM-generated text. To accompany this data, we release the code used for dataset construction, formatting, filtering, and scraping. Last, we release two Setswana LLM-translated benchmarks, MMLU-tsn and GSM8K-tsn, to measure Setswana knowledge and reasoning capabilities.