Revisiting Multilingual Data Mixtures in Language Model Pretraining

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

It remains unclear whether scaling the number of languages in multilingual pretraining inevitably incurs a “multilingual curse”—i.e., improved performance on high-resource languages at the expense of low-resource ones. Method: We systematically train and evaluate 1.1B- and 3B-parameter models on diverse multilingual corpora, varying the number of languages (up to 400) and their data proportions while controlling for confounding factors. Cross-lingual transfer is assessed via zero-shot and few-shot probing across typologically diverse languages. Contribution/Results: We find that English, when used as a linguistic hub, substantially enhances cross-family transfer without degrading low-resource language performance. Crucially, scaling to 400 languages induces no significant performance degradation across either high- or low-resource languages. These results demonstrate that carefully designed multilingual data mixing enables synergistic improvement of both high- and low-resource language capabilities—challenging the conventional trade-off assumption between multilingual coverage and model performance.

Technology Category

Application Category

📝 Abstract

The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the curse of multilinguality). In this work, we investigate these assumptions by training 1.1B and 3B parameter LLMs on diverse multilingual corpora, varying the number of languages from 25 to 400. Our study challenges common beliefs surrounding multilingual training. First, we find that combining English and multilingual data does not necessarily degrade the in-language performance of either group, provided that languages have a sufficient number of tokens included in the pretraining corpus. Second, we observe that using English as a pivot language (i.e., a high-resource language that serves as a catalyst for multilingual generalization) yields benefits across language families, and contrary to expectations, selecting a pivot language from within a specific family does not consistently improve performance for languages within that family. Lastly, we do not observe a significant "curse of multilinguality" as the number of training languages increases in models at this scale. Our findings suggest that multilingual data, when balanced appropriately, can enhance language model capabilities without compromising performance, even in low-resource settings

Problem

Research questions and friction points this paper is trying to address.

Investigates multilingual data mixtures' impact on language model pretraining performance

Challenges assumptions about trade-offs between language coverage and model capabilities

Examines whether increasing training languages causes performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balancing multilingual data mixtures without performance degradation

Using English as pivot language benefits cross-language generalization

No significant curse of multilinguality observed at scale

🔎 Similar Papers

DEPT: Decoupled Embeddings for Pre-training Language Models