🤖 AI Summary
To address the persistent training stagnation and slow convergence observed when “upcycling” pretrained dense models into Mixture-of-Experts (MoE) architectures, this paper proposes Drop-Upcycling. Our method leverages pretrained knowledge while applying block-wise statistical reinitialization to expert weights, jointly optimizing the routing mechanism. Drop-Upcycling is the first approach to synergistically combine knowledge inheritance with selective weight reinitialization—thereby substantially enhancing expert specialization and overcoming long-standing efficiency bottlenecks in upcycling-based MoE construction. Evaluated on a 5.9B-activated-parameter MoE model, it matches the performance of a 13B-dense baseline while reducing training FLOPs by ~75%. Across training scales exceeding hundreds of billions of tokens, it consistently outperforms existing MoE construction methods.
📝 Abstract
The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.