Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates the impact of English data inclusion in continual pretraining (CPT) on multilingual large language models’ adaptation to new target languages. It reveals that while mixing English data does not degrade validation perplexity, it significantly enhances in-context learning (ICL) and generalization in the target language; conversely, monolingual CPT often triggers catastrophic forgetting. Method: To diagnose this phenomenon, the authors introduce the first language-agnostic ICL evaluation benchmark and propose a novel English-free CPT paradigm integrating curriculum learning with weight exponential moving average (EMA). Contribution/Results: Experiments demonstrate that the proposed method effectively mitigates forgetting, stabilizes downstream task performance, and suppresses parameter drift. It provides a reproducible, lightweight technical pathway for efficient adaptation of multilingual LMs to low-resource languages.

Technology Category

Application Category

📝 Abstract

Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts in the target language as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.

Problem

Research questions and friction points this paper is trying to address.

Role of English data in multilingual LLM adaptation

Catastrophic forgetting in continued pretraining without English

Mitigating English dependency via curriculum learning and EMA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continued pretraining adapts LLMs to new languages

English data inclusion critical for downstream capabilities

Curriculum learning and EMA mitigate English dependency

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models