On the Hidden Objective Biases of Group-based Reinforcement Learning

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work uncovers a structural discrepancy between reward optimization and training objectives in group-based reinforcement learning methods such as GRPO. By establishing a unified surrogate objective framework and integrating optimization dynamics modeling with theoretical analysis, the study identifies three systematic flaws: gradient bias on prefix tokens induced by non-uniform group weighting, AdamW’s insensitivity to reward scaling, and momentum-driven excursions beyond clipping boundaries. These findings provide a rigorous theoretical foundation and concrete directions for designing more robust and consistent post-training algorithms.

Technology Category

Application Category

📝 Abstract

Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.

Problem

Research questions and friction points this paper is trying to address.

group-based reinforcement learning

objective bias

reward optimization

training dynamics

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-based Reinforcement Learning

Gradient Bias

Reward Scaling