GOPO: Policy Optimization using Ranked Rewards

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the mismatch between existing policy optimization methods and preference-based reward modeling in non-verifiable reward settings—such as summarization and instruction following—where reliance on absolute reward values limits performance. To resolve this, the authors propose Group-wise Ordinal Policy Optimization (GOPO), which, for the first time, directly incorporates ordinal ranking information from rewards into policy training. By applying ordinal transformations, GOPO aligns the objectives of preference learning and policy optimization while avoiding dependence on unreliable reward magnitudes. Experiments demonstrate that GOPO significantly improves training stability and sample efficiency across diverse tasks and model scales: compared to GRPO, it yields superior training and validation reward trajectories, achieves higher LLM-as-judge scores at most intermediate steps, and reaches comparable policy quality with fewer training iterations.

Technology Category

Application Category

📝 Abstract

Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning from human feedback

reward model

policy optimization

non-verifiable rewards

relative preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Optimization

Ranked Rewards

Reinforcement Learning from Human Feedback