What is the Alignment Objective of GRPO?

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work investigates the preference aggregation mechanism of Group Policy Optimisation (GRPO) and its fundamental distinction from standard RLHF’s log-likelihood pooling. Method: Through formal modeling, we characterize GRPO’s implicit objective as a reverse KL-divergence-regularized nonstandard preference aggregation, jointly defined by shift-and-scale-normalized rewards and a reverse KL penalty. We rigorously derive explicit stationary policy solutions for binary, pairwise, and large-group limits, and quantitatively analyze alignment performance dependence on regularization strength and confidence intervals. Contribution/Results: We establish that substituting direct KL divergence or omitting reward scaling fundamentally alters aggregation behavior. This constitutes the first systematic theoretical framework for understanding GRPO’s foundations and optimizing alignment in advanced AI models—including DeepSeek-R1-Zero and DeepSeekMath—thereby bridging critical gaps between empirical practice and principled preference learning.

Technology Category

Application Category

📝 Abstract

In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function, which we show to essentially correspond to the reverse Kullback-Leibler (KL) divergence between the aggregation policy and the reference policy. Interestingly, we demonstrate that for groups of size two, the reward preference model corresponds to pairwise comparison preferences, similar to those in other alignment methods based on pairwise comparison feedback. We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size. This provides insights into the dependence of the aggregate preference on parameters such as the regularisation constant and the confidence margin of question answers. Finally, we discuss the aggregation of preferences obtained by modifying the GRPO algorithm to use direct KL divergence as the penalty or to use rewards without scale normalisation.

Problem

Research questions and friction points this paper is trying to address.

Analyzes GRPO algorithm's preference aggregation

Compares GRPO with standard logarithmic pooling

Explores impact of penalty functions on preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

GRPO uses reward preference model

Penalty function discourages policy deviation

Aggregation differs from standard logarithmic pooling

🔎 Similar Papers

No similar papers found.