π€ AI Summary
Large language models still struggle with insufficient accuracy and consistency in complex mathematical reasoning tasks. This work proposes iGRPO, a method based on Group Relative Policy Optimization, which introduces a two-stage iterative optimization mechanism: first sampling multiple reasoning drafts and selecting the one with the highest reward, then performing refined policy updates conditioned on this best draft. Additionally, iGRPO incorporates a dynamic self-feedback mechanism that leverages the modelβs own high-quality generated drafts to mitigate entropy collapse, thereby enhancing training stability and reasoning performance. Evaluated on the AIME24 and AIME25 benchmarks, the proposed method achieves state-of-the-art results with accuracies of 85.62% and 79.64%, respectively, significantly outperforming existing baselines.
π Abstract
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.