🤖 AI Summary
This work addresses the parameter interference problem in multilingual machine translation fine-tuning of large language models caused by training on parallel corpora. To mitigate this issue, the authors propose Mix-MoE, a two-stage mixture-of-experts (MoE) post-pretraining framework. In the first stage, monolingual corpora are used to train language modeling experts to preserve monolingual knowledge; in the second stage, bilingual corpora train machine translation experts to acquire translation capabilities. The approach introduces an innovative dynamic routing mechanism based on Fourier-transform-derived features to enhance expert collaboration and improve textual structure modeling. Experimental results demonstrate that Mix-MoE significantly outperforms existing baselines, effectively alleviating parameter interference while substantially improving multilingual translation performance.
📝 Abstract
Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.