🤖 AI Summary
This work addresses the inefficiency in training sparse Mixture-of-Experts (MoE) models, where introducing too many experts early in training incurs substantial memory and communication overhead. To mitigate this, the authors propose a progressive training framework that treats MoE capacity as expandable memory, dynamically increasing the number of experts during training. This approach integrates sparsity-aware scaling laws to allocate compute-optimal token budgets across training stages. By jointly modeling dynamic expert expansion and scaling laws, the method establishes an efficient training paradigm that starts lightweight and scales up capacity later. Large-scale experiments demonstrate that the framework achieves performance comparable to fixed-expert configurations while significantly improving clock-time efficiency, reducing both GPU costs and overall training time.
📝 Abstract
Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.