🤖 AI Summary
In RLVR (Reinforcement Learning with Verifiable Rewards), mean baselines induce policy entropy oscillations—manifesting as catastrophic collapse or unbounded explosion—under sparse reward settings.
Method: We propose the Group K-Quantile Baseline, the first baseline theoretically proven to guarantee bidirectional entropy safety. It eliminates explicit value function estimation and integrates first-order softmax policy updates with response-level dual-control mechanisms to enable sparse advantage allocation and stable policy optimization on Qwen3-8B/14B.
Contribution/Results: Our method identifies the root cause of entropy instability in mean baselines and constructs an automatic gating mechanism via quantile-based baselines: encouraging successful responses on hard problems while suppressing erroneous ones on easy problems. On AIME 2024/2025 and AMC 2023 benchmarks, it consistently improves pass@1, with ~80% of responses assigned zero advantage—effectively suppressing entropy fluctuations and enhancing both reasoning stability and computational efficiency.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.