Quantile Advantage Estimation for Entropy-Safe Reasoning

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In RLVR (Reinforcement Learning with Verifiable Rewards), mean baselines induce policy entropy oscillations—manifesting as catastrophic collapse or unbounded explosion—under sparse reward settings. Method: We propose the Group K-Quantile Baseline, the first baseline theoretically proven to guarantee bidirectional entropy safety. It eliminates explicit value function estimation and integrates first-order softmax policy updates with response-level dual-control mechanisms to enable sparse advantage allocation and stable policy optimization on Qwen3-8B/14B. Contribution/Results: Our method identifies the root cause of entropy instability in mean baselines and constructs an automatic gating mechanism via quantile-based baselines: encouraging successful responses on hard problems while suppressing erroneous ones on easy problems. On AIME 2024/2025 and AMC 2023 benchmarks, it consistently improves pass@1, with ~80% of responses assigned zero advantage—effectively suppressing entropy fluctuations and enhancing both reasoning stability and computational efficiency.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.
Problem

Research questions and friction points this paper is trying to address.

Addresses entropy collapse and explosion in RLVR training
Proposes quantile baseline to replace problematic mean estimation
Ensures entropy safety and stabilizes credit assignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantile Advantage Estimation replaces mean baseline
Two-regime gate reinforces rare successes and targets failures
Two-sided entropy safety prevents entropy collapse and explosion
🔎 Similar Papers
No similar papers found.