A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses key challenges in preference-based reinforcement learning alignment—namely unstable policy updates, ambiguous gradient directions, high variance, and poor interpretability—by introducing a unified Pair-GRPO framework comprising Soft-Pair-GRPO and Hard-Pair-GRPO variants. These optimize policy updates through implicit and explicit preference constraints, respectively. The study establishes a gradient equivalence theorem that elucidates the stability mechanism of the Soft variant and incorporates local probability constraints together with a constrained KL-fitting mechanism to yield deterministic gradient directions and reduce variance. By replacing scalar rewards with pairwise preferences and integrating a clipped objective with KL regularization, the proposed approach significantly outperforms existing methods on benchmarks including HH-RLHF, UltraFeedback, and HalfCheetah-v4, achieving state-of-the-art performance in alignment quality, human preference win rates, training stability, and generalization.

📝 Abstract

Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive scalar multiple of standard GRPO's gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants--including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.

Problem

Research questions and friction points this paper is trying to address.

RLHF

preference learning

policy stability

gradient variance

LLM alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pair-GRPO

preference-based RL

gradient stability