AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the significant gap in higher-order reasoning capabilities—such as mathematical reasoning—of open-source large language models on African languages compared to closed-source systems, primarily attributed to imbalanced domain coverage in training corpora and a lack of task-relevant knowledge. The authors conduct continual pretraining on 26 billion tokens across five foundational model architectures, including Llama 3.1, Gemma 3, and Qwen 3, systematically evaluating data mixing strategies for 20 African languages. Their findings reveal that task-aligned data exerts a stronger influence on downstream performance than the base model’s inherent multilingual capacity. Incorporating mathematical, code, and synthetically generated translation data substantially enhances both reasoning and translation capabilities, with notable gains observed in document-level, long-context tasks. All resulting models are publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm).
Problem

Research questions and friction points this paper is trying to address.

African languages
continued pre-training
low-resource languages
mathematical reasoning
domain coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

continued pre-training
data mixing
African languages
model architecture
synthetic translation
🔎 Similar Papers
No similar papers found.