Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Micro-batch-level load balancing (LBL) in Mixture-of-Experts (MoE) training hinders expert specialization by enforcing uniform expert activation across micro-batches, thereby impeding domain-specific learning. Method: This paper proposes a global-batch LBL mechanism that synchronizes expert invocation frequencies across micro-batches within a global batch during distributed training, enabling corpus-granularity load balancing without architectural modification. It integrates cross-micro-batch gradient synchronization and routing optimization to mitigate LBL-induced specialization suppression. Contribution/Results: We provide the first empirical evidence that micro-batch-level LBL fundamentally constrains expert specialization. Evaluated on a 42.8B-parameter MoE model trained on 400B tokens, our approach significantly reduces perplexity and consistently improves downstream performance—especially on domain-specific tasks such as programming—demonstrating both effectiveness and scalability for billion-scale MoE models.

Technology Category

Application Category

📝 Abstract

This paper revisits the implementation of $ extbf{L}$oad-$ extbf{b}$alancing $ extbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$. Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a $ extbf{micro-batch}$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ($ extit{e.g.}$, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a $ extbf{global-batch}$ to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to $ extbf{42.8B}$ total parameters and $ extbf{400B}$ tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.

Problem

Research questions and friction points this paper is trying to address.

Load Balancing Loss

Mixture of Experts

Domain-specific Tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed Experts (MoEs) Models

Improved Load Balancing Loss (LBL)

Global-Batch Training

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions

2024-10-03arXiv.orgCitations: 0

Authors to Follow