A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$Δ$ Integration into Upcycled MoE

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the challenge of efficiently extending large language models to new languages, which typically demands substantial pretraining and alignment data—often prohibitively expensive—and where existing data-free merging approaches struggle to balance retention of original capabilities with acquisition of new linguistic competence. The authors propose PARAMΔ integration, a method that upgrades a dense model to a mixture-of-experts (MoE) architecture by assigning language-specific experts and grafting post-training parameter deltas (Δ_post) to enable effective language expansion without additional alignment data. Under identical computational or parameter budgets, this approach significantly outperforms baseline methods, simultaneously preserving performance in original languages and enhancing capabilities in newly added ones. The technique demonstrates broad applicability across diverse models and post-training scenarios, overcoming the inherent trade-offs that limit conventional fusion strategies.
📝 Abstract
Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($Δ_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.
Problem

Research questions and friction points this paper is trying to address.

Multilingual LLMs
Language Expansion
Parameter Conflicts
Data Efficiency
Model Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Parameter Delta
Post-training
Language Expansion
Data-Efficient
🔎 Similar Papers
No similar papers found.