Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of parameter explosion, catastrophic forgetting of previously supported languages, and low expansion efficiency in multilingual continual scaling of large language models (LLMs), this paper proposes LayerMoE—a layer-adaptive sparse mixture-of-experts (MoE) architecture. Its core innovations are: (1) dynamic expert assignment per layer based on inter-layer linguistic representation similarity, and (2) a lightweight pre-classifier to guide token routing for legacy languages, enabling expert reuse and stable knowledge retention. LayerMoE achieves single-stage expansion with 60% fewer experts and reduces expert增量 by 33.3% in lifelong multilingual scaling. It significantly outperforms state-of-the-art methods on retained performance for old languages, markedly alleviating forgetting. The approach strikes an optimal balance among parameter efficiency, stability against forgetting, and scalability—enabling efficient, robust, and sustainable multilingual LLM growth.

Technology Category

Application Category

📝 Abstract
Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs. The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages. To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts). Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages. To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer. Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts. Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens. Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Efficiently expanding multilingual LLMs without forgetting old languages
Optimizing expert allocation per layer based on language similarity
Reducing parameter costs while maintaining old language performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise expert allocation algorithm
Classifier-guided routing for old languages
Fewer experts via language similarity
🔎 Similar Papers
No similar papers found.
X
Xue Zhang
Key Laboratory of Big Data & Artificial Intelligence in Transportation, (Beijing Jiaotong University), Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Yunlong Liang
Yunlong Liang
WeChat
Natural Language Processing (NLP)
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
Songming Zhang
Songming Zhang
Beijing Jiaotong University
natural language processingtext generationmachine translation
Y
Yufeng Chen
Key Laboratory of Big Data & Artificial Intelligence in Transportation, (Beijing Jiaotong University), Ministry of Education; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Jinan Xu
Jinan Xu
Professor of School of Computer and Information Technology, Beijing Jiaotong University
NLPMachine TranslationLLM
J
Jie Zhou
Pattern Recognition Center, WeChat AI, Tencent Inc, China