MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of open-source, small-scale decoder-only language models and high-quality multilingual pretraining corpora for South Africa’s 11 official languages—nine of which are low-resource. To bridge this gap, the authors construct the reproducible MzansiText multilingual corpus and train MzansiLM, a 125-million-parameter decoder-only language model. Through monolingual and multilingual task-specific fine-tuning as well as multitask instruction tuning, the model demonstrates strong performance on low-resource languages: it achieves a BLEU score of 20.65 on isiXhosa data-to-text generation, surpassing an encoder-decoder baseline ten times its size, and attains a macro-F1 of 78.5% on news topic classification. This project presents the first open-source decoder-only language model and corpus covering all officially recognized written languages of South Africa, significantly advancing NLP research for low-resource settings.

Technology Category

Application Category

📝 Abstract
Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
decoder-only language model
South African languages
instruction finetuning
multilingual NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource languages
decoder-only language model
multilingual pretraining corpus
instruction finetuning
South African languages
🔎 Similar Papers
No similar papers found.
A
Anri Lombard
University of Cape Town
Simbarashe Mawere
Simbarashe Mawere
Student, University of Cape Town
natural language processingcomputational linguisticstokenisation
T
Temi Aina
University of Cape Town
E
Ethan Wolff
University of Cape Town
S
Sbonelo Gumede
University of Cape Town
E
Elan Novick
University of Cape Town
Francois Meyer
Francois Meyer
PhD student, University of Cape Town
Natural Language ProcessingMachine Learning
Jan Buys
Jan Buys
University of Cape Town
Natural Language ProcessingMachine Learning