Your Group-Relative Advantage Is Biased

📅 2026-01-13

📈 Citations: 6

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses a systematic bias in advantage estimation within population-based reinforcement learning, where difficult prompts are consistently underestimated while easy ones are overestimated, thereby disrupting the balance between exploration and exploitation. The study is the first to uncover the underlying mechanism of this bias and proposes a novel method—History-Aware Adaptive Difficulty Weighting (HA-DW)—which dynamically corrects advantage estimates by leveraging training dynamics and difficulty anchors. Through theoretical analysis grounded in GRPO and its variants, the effectiveness of HA-DW is empirically validated across five mathematical reasoning benchmarks, demonstrating significant performance improvements. These results underscore that correcting advantage bias is crucial for effective reinforcement learning from verifiable feedback (RLVR).

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates based on an evolving difficulty anchor and training dynamics. Both theoretical analysis and experiments on five mathematical reasoning benchmarks demonstrate that HA-DW consistently improves performance when integrated into GRPO and its variants. Our results suggest that correcting biased advantage estimation is critical for robust and efficient RLVR training.

Problem

Research questions and friction points this paper is trying to address.

group-relative advantage

bias

reinforcement learning

verifier rewards

advantage estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

group-relative advantage

bias correction

adaptive reweighting