Multilingual Language Model Pretraining using Machine-translated Data

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the insufficient quality and diversity of pretraining corpora for non-English languages in multilingual large language models (MLLMs), this work introduces TransWebEdu—a 1.7-trillion-token multilingual pretraining corpus constructed via high-precision machine translation from FineWeb-Edu, a high-quality English academic web dataset. Leveraging this corpus, we train TransWebLLM, a 1.3B-parameter model from scratch. Our study provides the first systematic empirical validation that machine-translated multilingual data achieves performance on key reasoning tasks comparable to—and in several cases surpassing—that of leading proprietary multilingual corpora. Remarkably, using less than 5% of translated data, we establish new state-of-the-art results in commonsense reasoning for five low-resource languages, including Arabic. TransWebLLM outperforms prominent closed-source multilingual models—including Llama-3.2, Qwen2.5, and Gemma—across all nine non-English reasoning benchmarks. The full corpus, model weights, and training code are publicly released.

Technology Category

Application Category

📝 Abstract

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.

Problem

Research questions and friction points this paper is trying to address.

Improving multilingual LLMs with machine-translated data

Addressing performance gaps in non-English languages

Enhancing multilingual reasoning tasks with less data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine-translated multilingual pretraining

High-quality English dataset translation

Open-source corpus and model release

🔎 Similar Papers

No similar papers found.