Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of adapting large language models to mid- and low-resource languages. Instead of conventional continued pretraining, it proposes an efficient multilingual expansion paradigm based on model upscaling. Its three core contributions are: (1) the first systematic empirical validation that model scaling outperforms data augmentation in improving low-resource language performance; (2) a scalable adaptation strategy that significantly reduces target-language data requirements—by multiple-fold for smaller models under continued training—while effectively mitigating catastrophic forgetting of source-language (e.g., English) capabilities; and (3) an investigation of language-level parameter merging (e.g., PMT, TIES-Merging), demonstrating that merging upsampled models yields superior multilingual performance and consistency compared to merging smaller base models, thereby enabling modular, high-fidelity multilingual systems. Experiments include FLOP-matched ablations, comprehensive multilingual capability evaluation, and forgetting analysis.

Technology Category

Application Category

📝 Abstract
Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.
Problem

Research questions and friction points this paper is trying to address.

Scaling pretrained models for efficient adaptation to new languages
Reducing catastrophic forgetting while preserving base language capabilities
Exploring merging strategies for modular multilingual system construction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Upscaling base models for efficient language adaptation
Scaling reduces catastrophic forgetting in target languages
Merging scaled models for modular multilingual systems
🔎 Similar Papers
No similar papers found.