iGRPO: Self-Feedback-Driven LLM Reasoning

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Large language models still struggle with insufficient accuracy and consistency in complex mathematical reasoning tasks. This work proposes iGRPO, a method based on Group Relative Policy Optimization, which introduces a two-stage iterative optimization mechanism: first sampling multiple reasoning drafts and selecting the one with the highest reward, then performing refined policy updates conditioned on this best draft. Additionally, iGRPO incorporates a dynamic self-feedback mechanism that leverages the model’s own high-quality generated drafts to mitigate entropy collapse, thereby enhancing training stability and reasoning performance. Evaluated on the AIME24 and AIME25 benchmarks, the proposed method achieves state-of-the-art results with accuracies of 85.62% and 79.64%, respectively, significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Mathematical Reasoning

Reinforcement Learning

Accuracy

Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Group Relative Policy Optimization

self-feedback

reinforcement learning