Upcycling Large Language Models into Mixture of Experts

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 1

career value

183K/year

🤖 AI Summary

To address the challenge of efficiently scaling capacity in pretrained dense large language models (LLMs), this paper proposes a sparse Mixture-of-Experts (MoE) upgrading paradigm tailored for billion-parameter models. Methodologically, it introduces: (1) a novel virtual-group initialization coupled with dynamic weight scaling to enable fine-grained expert partitioning and stable training; (2) a softmax-then-top-K routing mechanism, which substantially outperforms conventional sequential routing; and (3) a large-scale upcycling training framework that avoids de novo training. Applying this approach to upgrade Nemotron-4 15B on 1 trillion tokens, the resulting MoE model achieves 67.6% on MMLU—surpassing the continuous-training baseline (65.3%) trained on equivalent data—while maintaining high inference efficiency. The method thus delivers simultaneous gains in both accuracy and computational efficiency, demonstrating a practical pathway for cost-effective LLM capacity expansion.

Technology Category

Application Category

📝 Abstract

Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel"virtual group"initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

Problem

Research questions and friction points this paper is trying to address.

Optimizing techniques for upcycling dense models into MoE architectures

Improving expert routing and granularity in sparse MoE models

Enhancing model capacity and accuracy via upcycling pre-trained LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual group initialization for MoE

Weight scaling in fine-grained MoE

Softmax-then-topK expert routing

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts