Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

GRPO, a reinforcement learning framework for large language model training, suffers from policy update stagnation when encountering all-negative sample groups—i.e., groups where all generated responses are incorrect—thereby impeding learning. Method: We propose Diverse-GRPO, which first theoretically proves and empirically validates that all-negative groups still contain exploitable gradient signals. Building on this insight, we introduce a spectral diversity mechanism: AI-feedback-driven response resampling coupled with group-level spectral analysis to model response divergence, thereby relaxing the implicit assumption that policy updates require positive samples. Contribution/Results: Integrated into the GRPO framework, Diverse-GRPO enables continuous policy optimization using erroneous responses. Evaluated on 7B, 14B, and 32B models across ten reasoning benchmarks under both offline and online settings, it consistently outperforms baselines. This work is the first to systematically uncover and harness the latent learning value embedded in incorrect responses.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has demonstrated significant success in enhancing reasoning capabilities in large language models (LLMs). One of the most widely used RL methods is Group Relative Policy Optimization (GRPO)~cite{Shao-2024-Deepseekmath}, known for its memory efficiency and success in training DeepSeek-R1~cite{Guo-2025-Deepseek}. However, GRPO stalls when all sampled responses in a group are incorrect -- referred to as an emph{all-negative-sample} group -- as it fails to update the policy, hindering learning progress. The contributions of this paper are two-fold. First, we propose a simple yet effective framework that introduces response diversity within all-negative-sample groups in GRPO using AI feedback. We also provide a theoretical analysis, via a stylized model, showing how this diversification improves learning dynamics. Second, we empirically validate our approach, showing the improved performance across various model sizes (7B, 14B, 32B) in both offline and online learning settings with 10 benchmarks, including base and distilled variants. Our findings highlight that learning from all-negative-sample groups is not only feasible but beneficial, advancing recent insights from citet{Xiong-2025-Minimalist}.

Problem

Research questions and friction points this paper is trying to address.

GRPO stalls with all-incorrect response groups

Diversifying responses in negative groups improves learning

Validated approach enhances performance across model sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces response diversity in GRPO using AI feedback

Theoretically analyzes diversification's impact on learning dynamics

Empirically validates improved performance across model sizes

🔎 Similar Papers

No similar papers found.