Your Group-Relative Advantage Is Biased

๐Ÿ“… 2026-01-13
๐Ÿ“ˆ Citations: 6
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses a systematic bias in advantage estimation within population-based reinforcement learning, where difficult prompts are consistently underestimated while easy ones are overestimated, thereby disrupting the balance between exploration and exploitation. The study is the first to uncover the underlying mechanism of this bias and proposes a novel methodโ€”History-Aware Adaptive Difficulty Weighting (HA-DW)โ€”which dynamically corrects advantage estimates by leveraging training dynamics and difficulty anchors. Through theoretical analysis grounded in GRPO and its variants, the effectiveness of HA-DW is empirically validated across five mathematical reasoning benchmarks, demonstrating significant performance improvements. These results underscore that correcting advantage bias is crucial for effective reinforcement learning from verifiable feedback (RLVR).

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.
Problem

Research questions and friction points this paper is trying to address.

group-relative advantage
bias
reinforcement learning
verifier rewards
advantage estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

group-relative advantage
bias correction
adaptive reweighting
reinforcement learning from verifier rewards
difficulty-aware training
๐Ÿ”Ž Similar Papers
No similar papers found.