Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in large language model (LLM) reinforcement learning where token-level surrogate objectives fail to effectively optimize sequence-level true rewards. We propose a first-order theoretical framework that, for the first time, formally characterizes the conditions under which token-level optimization is feasible and elucidates how training-inference mismatch and policy staleness impair training stability. Methodologically, we integrate importance sampling correction, gradient clipping, and routing-based replay to enable efficient online and offline policy updates within Mixture-of-Experts (MoE) architectures. Extensive experiments on a 30B-parameter MoE model demonstrate that our approach significantly improves training stability—remaining crash-free over hundreds of billions of GPU-hours—and achieves sustained, high-performance convergence.

Technology Category

Application Category

📝 Abstract
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Optimizes sequence-level reward via token-level objective in RL with LLMs
Minimizes training-inference discrepancy and policy staleness for stability
Stabilizes RL training using importance sampling, clipping, and Routing Replay
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes token-level surrogate for sequence rewards
Uses importance sampling and clipping for stability
Implements Routing Replay for MoE model stabilization
🔎 Similar Papers
No similar papers found.