Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the challenge of policy optimization for large language models under verifiable rewards (e.g., binary success/failure signals). We propose GRPO, the first method to formulate policy optimization as a KL-regularized contrastive loss problem, yielding a closed-form optimal policy. We rigorously prove that the success probability (p_n) monotonically increases across iterations and converges to a fixed point strictly superior to the initial value, and we derive its explicit recurrence relation and convergence bound. Theoretical analysis demonstrates that GRPO systematically amplifies success probability, providing the first framework for reinforcement learning with verifiable rewards that simultaneously guarantees interpretability, convergence, and computational tractability.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) was introduced and used successfully to train DeepSeek R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback Leibler ($mathsf{KL}$) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy $pi_{n}$ can be expressed explicitly in terms of the binary reward, as well as the first and second order statistics of the old policy ($pi_{n-1}$) and the reference policy $pi_0$. Iterating this scheme, we obtain a sequence of policies $pi_{n}$ for which we can quantify the probability of success $p_n$. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success $p_0$ and the regularization parameter $eta$ of the $mathsf{KL}$ regularizer. We show that the fixed point $p^*$ is guaranteed to be larger than $p_0$, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.

Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning in LLMs using verifiable rewards.

Formulate GRPO as KL-regularized contrastive loss.

Quantify and amplify policy success probability.

Innovation

Methods, ideas, or system contributions that make the work stand out.

GRPO uses verifiable rewards for LLM training.

KL regularized contrastive loss optimizes policy updates.

Success probability amplified through iterative policy refinement.

🔎 Similar Papers

No similar papers found.

Authors to Follow