GOPO: Policy Optimization using Ranked Rewards

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the mismatch between existing policy optimization methods and preference-based reward modeling in non-verifiable reward settings—such as summarization and instruction following—where reliance on absolute reward values limits performance. To resolve this, the authors propose Group-wise Ordinal Policy Optimization (GOPO), which, for the first time, directly incorporates ordinal ranking information from rewards into policy training. By applying ordinal transformations, GOPO aligns the objectives of preference learning and policy optimization while avoiding dependence on unreliable reward magnitudes. Experiments demonstrate that GOPO significantly improves training stability and sample efficiency across diverse tasks and model scales: compared to GRPO, it yields superior training and validation reward trajectories, achieves higher LLM-as-judge scores at most intermediate steps, and reaches comparable policy quality with fewer training iterations.

Technology Category

Application Category

📝 Abstract
Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning from human feedback
reward model
policy optimization
non-verifiable rewards
relative preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Optimization
Ranked Rewards
Reinforcement Learning from Human Feedback
Ordinal Ranking
Non-verifiable Rewards
🔎 Similar Papers
No similar papers found.