Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the persistent training stagnation and slow convergence observed when “upcycling” pretrained dense models into Mixture-of-Experts (MoE) architectures, this paper proposes Drop-Upcycling. Our method leverages pretrained knowledge while applying block-wise statistical reinitialization to expert weights, jointly optimizing the routing mechanism. Drop-Upcycling is the first approach to synergistically combine knowledge inheritance with selective weight reinitialization—thereby substantially enhancing expert specialization and overcoming long-standing efficiency bottlenecks in upcycling-based MoE construction. Evaluated on a 5.9B-activated-parameter MoE model, it matches the performance of a 13B-dense baseline while reducing training FLOPs by ~75%. Across training scales exceeding hundreds of billions of tokens, it consistently outperforms existing MoE construction methods.

Technology Category

Application Category

📝 Abstract
The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.
Problem

Research questions and friction points this paper is trying to address.

Improves Mixture of Experts training efficiency
Combines pre-trained dense model knowledge
Reduces training FLOPs significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines pre-trained models with partial re-initialization
Enhances Mixture of Experts specialization efficiency
Reduces training FLOPs significantly while maintaining performance
🔎 Similar Papers
No similar papers found.