🤖 AI Summary
This work addresses the training imbalance in Mixture-of-Experts (MoE) reinforcement learning caused by highly volatile expert loads. The authors propose a fine-grained load-balancing method leveraging replay-based mechanisms, which utilizes token routing information obtained during the RL replay phase to jointly optimize cross-batch expert reordering and intra-batch dynamic expert replication at the micro-batch level. A two-level scheduling strategy is further integrated with hierarchical network bandwidth constraints. Notably, this approach is the first to incorporate routing priors into training-time load balancing, overcoming the limitations of conventional methods that rely solely on historical load predictions. Experiments demonstrate that the proposed method achieves 1.6× the training throughput of Megatron-LM and 1.2× that of EPLB across various MoE architectures and RL tasks, reaching 90%–94% of the performance of an ideal load-balanced baseline.
📝 Abstract
Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL's rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6$\times$ over Megatron-LM and by up to 1.2$\times$ over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.