Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing reinforcement learning (RL) research predominantly focuses on dense models, while training Mixture-of-Experts (MoE) architectures in RL remains highly unstable. Method: This paper introduces router-aware importance sampling (RAIS), the first approach to incorporate router information into off-policy importance sampling. RAIS designs a rescaling strategy based on router logits to effectively suppress gradient variance, thereby improving training convergence and stability—without modifying network architecture or the underlying RL training paradigm. Contribution/Results: RAIS is plug-and-play and empirically validated across multiple continuous-control benchmark tasks. It significantly enhances training stability for MoE-based RL agents and yields an average 12.7% improvement in final task performance. These results demonstrate both the necessity and effectiveness of tailoring RL algorithms specifically for MoE structures.

Technology Category

Application Category

📝 Abstract

Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.

Problem

Research questions and friction points this paper is trying to address.

Addresses instability in Mixture-of-Experts reinforcement learning training

Proposes router-aware optimization for importance sampling weights

Reduces gradient variance and mitigates training divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Router-aware importance sampling optimization for MoE

Rescaling strategy guided by router logits

Reduces gradient variance and training divergence

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL