🤖 AI Summary
To address the insufficient quality and diversity of pretraining corpora for non-English languages in multilingual large language models (MLLMs), this work introduces TransWebEdu—a 1.7-trillion-token multilingual pretraining corpus constructed via high-precision machine translation from FineWeb-Edu, a high-quality English academic web dataset. Leveraging this corpus, we train TransWebLLM, a 1.3B-parameter model from scratch. Our study provides the first systematic empirical validation that machine-translated multilingual data achieves performance on key reasoning tasks comparable to—and in several cases surpassing—that of leading proprietary multilingual corpora. Remarkably, using less than 5% of translated data, we establish new state-of-the-art results in commonsense reasoning for five low-resource languages, including Arabic. TransWebLLM outperforms prominent closed-source multilingual models—including Llama-3.2, Qwen2.5, and Gemma—across all nine non-English reasoning benchmarks. The full corpus, model weights, and training code are publicly released.
📝 Abstract
High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.