π€ AI Summary
This work investigates the emergence of language specificity in Mixture-of-Experts (MoE) models during multilingual continual pretraining, a mechanism that remains poorly understood. Through analysis of routing dynamics in an English-centric MoE model trained on multilingual corpora, the study finds that language-specific behavior predominantly arises in the final layers and, for the first time, uncovers a link between lexical overlap and expert routing patterns. Building on these insights, the authors propose a parameter-efficient adaptation strategy that fine-tunes only the language-specific and shared experts within the last MoE layer. Evaluated on the MultiBLiMP and Belebele benchmarks, this approach updates fewer than 2% of the modelβs parameters while matching the performance of full fine-tuning of the top layers, substantially improving the trade-off between efficiency and effectiveness in multilingual settings.
π Abstract
Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe-routing-adaptation.