Continual Pre-training of MoEs: How robust is your router?

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates routing stability and catastrophic forgetting in sparse Mixture-of-Experts (MoE) large language models during continual pretraining (CPT): specifically, whether routing algorithms exacerbate forgetting, sustain load balancing, and whether dense-model CPT strategies generalize. Using Switch and DeepSeek MoE architectures (>2B parameters), we systematically evaluate Sinkhorn-balanced routing, Z-loss/Aux-loss routing, rehearsal, and learning-rate re-warming. We first discover that Sinkhorn and Z/Aux routing exhibit unexpected robustness to distributional shift under rehearsal-free CPT. We further confirm that MoE models maintain superior sample efficiency throughout CPT under FLOP-matched conditions. After 600B-token CPT, performance matches that of full retraining while reducing computational cost significantly; routing remains well-balanced, and forgetting remains bounded and controllable.

Technology Category

Application Category

📝 Abstract
Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay and learning rate re-warming and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer's routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale (>2B parameter switch and DeepSeek MoE LLMs trained for 600B tokens) empirical study across four MoE transformers to answer these questions. Our results establish a surprising robustness to distribution shifts for both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.
Problem

Research questions and friction points this paper is trying to address.

Impact of routing algorithms on continual pre-training of MoE transformers.
Robustness of MoE transformers to distribution shifts during continual pre-training.
Sample efficiency and cost-effectiveness of MoE transformers in continual pre-training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparsely-activated Mixture of Experts (MoE) transformers
Continual pre-training with minimal performance degradation
Robust routing algorithms for distribution shifts
🔎 Similar Papers
No similar papers found.