How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the inefficiency of traditional GRPO, which relies on low-latency, near on-policy training and incurs substantial system overhead. The authors propose Mu-GRPO, a novel framework that demonstrates for the first time that GRPO can tolerate rollout delays far exceeding prior assumptions. By introducing relaxed clipping to preserve informative gradients and a negative-advantage veto mechanism to discard unstable updates, Mu-GRPO enables efficient and stable off-policy training. The method structures training as a multi-stage sequential pipeline, significantly reducing system switching overhead. Evaluated across five language models and multiple mathematical reasoning benchmarks, Mu-GRPO matches or surpasses the performance of standard GRPO while achieving approximately 2× training speedup.
📝 Abstract
Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

off-policy
GRPO
LLM reinforcement learning
training efficiency
rollout staleness
Innovation

Methods, ideas, or system contributions that make the work stand out.

off-policy reinforcement learning
GRPO
rollout staleness
relaxed clipping
negative-advantage veto
🔎 Similar Papers
2024-06-27Conference on Empirical Methods in Natural Language ProcessingCitations: 1