Scaling Laws for Upcycling Mixture-of-Experts Language Models

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the efficiency bottleneck of constructing Mixture-of-Experts (MoE) language models via upcycling—i.e., reusing pretrained dense models—under large compute budgets. We propose and systematically analyze the upcycling paradigm, establishing, for the first time, a quantitative scaling law for MoE upcycling through empirical scaling experiments and compute-budget modeling. Our analysis reveals a nonlinear interaction between dense model capability and upcycling dataset size, which critically constrains training efficiency at scale. We further derive necessary and sufficient conditions under which upcycling dominates from-scratch training under budget constraints, and provide actionable scaling guidelines linking expert count, data volume, and model performance. Results demonstrate that, under moderate compute budgets, upcycled MoE models achieve superior cost-performance trade-offs compared to models trained de novo.

Technology Category

Application Category

📝 Abstract

Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.

Problem

Research questions and friction points this paper is trying to address.

Scaling laws for upcycling Mixture-of-Experts models

Interaction between dense and upcycled training datasets

Conditions for upcycling outperforming from-scratch training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Upcycling smaller to larger models

Mixture-of-Experts model efficiency

Empirical scaling laws identified

🔎 Similar Papers

Upcycling Large Language Models into Mixture of Experts