π€ AI Summary
This work addresses key limitations of Group Relative Policy Optimization (GRPO) in complex reasoning tasksβnamely, length bias induced by sequence-level advantage normalization, insufficient penalization of low-quality trajectories, and the underutilization of fine-grained preference signals embedded in intra-group reward rankings. To overcome these issues, the authors propose an implicit DPO-style contrastive regularizer that requires no additional annotations and, for the first time, transforms intra-group trajectory reward orderings into implicit preference signals. Integrated within the GRPO framework, this approach introduces denser supervisory constraints, significantly enhancing the discriminability and supervision density of policy updates. Empirical results demonstrate consistent improvements over standard GRPO across multiple mathematical reasoning benchmarks, with particularly notable gains on samples where the original method fails.
π Abstract
Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO.