Multilingual Language Model Pretraining using Machine-translated Data

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient quality and diversity of pretraining corpora for non-English languages in multilingual large language models (MLLMs), this work introduces TransWebEdu—a 1.7-trillion-token multilingual pretraining corpus constructed via high-precision machine translation from FineWeb-Edu, a high-quality English academic web dataset. Leveraging this corpus, we train TransWebLLM, a 1.3B-parameter model from scratch. Our study provides the first systematic empirical validation that machine-translated multilingual data achieves performance on key reasoning tasks comparable to—and in several cases surpassing—that of leading proprietary multilingual corpora. Remarkably, using less than 5% of translated data, we establish new state-of-the-art results in commonsense reasoning for five low-resource languages, including Arabic. TransWebLLM outperforms prominent closed-source multilingual models—including Llama-3.2, Qwen2.5, and Gemma—across all nine non-English reasoning benchmarks. The full corpus, model weights, and training code are publicly released.

Technology Category

Application Category

📝 Abstract
High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.
Problem

Research questions and friction points this paper is trying to address.

Improving multilingual LLMs with machine-translated data
Addressing performance gaps in non-English languages
Enhancing multilingual reasoning tasks with less data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine-translated multilingual pretraining
High-quality English dataset translation
Open-source corpus and model release
🔎 Similar Papers
No similar papers found.
J
Jiayi Wang
Centre for Artificial Intelligence, University College London
Y
Yao Lu
Centre for Artificial Intelligence, University College London
Maurice Weber
Maurice Weber
Together AI
Large Language ModelsKnowledge DistillationMachine Learning
Max Ryabinin
Max Ryabinin
Together AI
deep learningnatural language processingdistributed training
D
D. Adelani
Mila, McGill University, Canada CIFAR AI Chair
Y
Yihong Chen
Centre for Artificial Intelligence, University College London
Raphael Tang
Raphael Tang
Microsoft
machine learningnatural language processingmultimodalityinformation retrieval
P
Pontus Stenetorp
Research and Development Center for Large Language Models, National Institute of Informatics, Centre for Artificial Intelligence, University College London