On the Hidden Objective Biases of Group-based Reinforcement Learning

๐Ÿ“… 2026-01-08
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work uncovers a structural discrepancy between reward optimization and training objectives in group-based reinforcement learning methods such as GRPO. By establishing a unified surrogate objective framework and integrating optimization dynamics modeling with theoretical analysis, the study identifies three systematic flaws: gradient bias on prefix tokens induced by non-uniform group weighting, AdamWโ€™s insensitivity to reward scaling, and momentum-driven excursions beyond clipping boundaries. These findings provide a rigorous theoretical foundation and concrete directions for designing more robust and consistent post-training algorithms.

Technology Category

Application Category

๐Ÿ“ Abstract
Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.
Problem

Research questions and friction points this paper is trying to address.

group-based reinforcement learning
objective bias
reward optimization
training dynamics
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-based Reinforcement Learning
Gradient Bias
Reward Scaling
Optimizer Momentum
Surrogate Objective
๐Ÿ”Ž Similar Papers
No similar papers found.