🤖 AI Summary
This work addresses the challenge of efficiently extending large language models to new languages, which typically demands substantial pretraining and alignment data—often prohibitively expensive—and where existing data-free merging approaches struggle to balance retention of original capabilities with acquisition of new linguistic competence. The authors propose PARAMΔ integration, a method that upgrades a dense model to a mixture-of-experts (MoE) architecture by assigning language-specific experts and grafting post-training parameter deltas (Δ_post) to enable effective language expansion without additional alignment data. Under identical computational or parameter budgets, this approach significantly outperforms baseline methods, simultaneously preserving performance in original languages and enhancing capabilities in newly added ones. The technique demonstrates broad applicability across diverse models and post-training scenarios, overcoming the inherent trade-offs that limit conventional fusion strategies.
📝 Abstract
Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($Δ_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.