Pula: Training Large Language Models for Setswana

📅 2024-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-quality large language models (LLMs) for Setswana, a low-resource African language. We introduce Pula, the first bilingual (English–Setswana) LLM series (1B–14B parameters) tailored to Setswana. Methodologically, we establish an end-to-end data engineering pipeline: (i) releasing Marothodi—the largest publicly available Setswana corpus to date; (ii) curating Medupi, the first Setswana instruction-tuning dataset; and (iii) proposing two novel evaluation benchmarks, MMLU-tsn and GSM8K-tsn. Leveraging the LLaMA architecture, we combine human translation, reformatting of public resources, and controllably synthesized data, augmented by rigorous filtering and a custom web-crawling toolkit, followed by supervised fine-tuning and instruction tuning. Experiments show that Pula-8B and Pula-14B outperform GPT-4o and Gemini 1.5 Pro on English–Setswana machine translation and Setswana-language reasoning tasks, achieving state-of-the-art performance at comparable scales. All model weights, training logs, source code, and the three datasets are fully open-sourced.

Technology Category

Application Category

📝 Abstract
In this work we present Pula, a suite of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, Pula 8B and Pula 14B outperform GPT-4o and Gemini 1.5 Pro on English-Setswana translation tasks and achieve state-of-the-art performance on Setswana reasoning tasks for their size. We release the weights for Pula 1B, 3B, 8B, and 14B as well as training logs and training and evaluation code. Alongside Pula, we release the largest-ever Setswana text corpus, Marothodi, and the first comprehensive Setswana instruction-tuning dataset, Medupi, consisting of reformatted datasets, translated corpora, and synthetic LLM-generated text. To accompany this data, we release the code used for dataset construction, formatting, filtering, and scraping. Last, we release two Setswana LLM-translated benchmarks, MMLU-tsn and GSM8K-tsn, to measure Setswana knowledge and reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Develop bilingual models for Setswana and English.
Create largest Setswana corpus and instruction dataset.
Release benchmarks for Setswana knowledge and reasoning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual models for Setswana and English
Largest Setswana corpus and instruction dataset
Setswana benchmarks for knowledge evaluation
🔎 Similar Papers
No similar papers found.