Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing offline reinforcement learning methods for alignment either rely on preference pairs (e.g., DPO, REBEL) or require online policy sampling (e.g., PPO, GRPO), limiting their applicability and efficiency. To address these limitations, we propose Quantile Reward Policy Optimization (QRPO), the first offline alignment framework that directly leverages **pointwise absolute rewards**. QRPO integrates quantile reward modeling into a KL-regularized RL objective and derives a closed-form analytical solution for the partition function, thereby eliminating dependence on relative preference signals and online sampling. It further introduces closed-form quantile regression and precomputation mechanisms, ensuring both theoretical rigor and computational efficiency. On benchmarks including AlpacaEval 2 and LeetCode, QRPO achieves significant improvements over DPO, REBEL, and SimPO on 8B-parameter models, effectively mitigating length bias while enhancing generation quality and generalization.

Technology Category

Application Category

📝 Abstract
Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are limited to learning from preference pairs or relative signals. To bridge this gap, we introduce emph{Quantile Reward Policy Optimization} (QRPO), which learns from pointwise absolute rewards while preserving the simplicity and offline applicability of DPO-like methods. QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective. This reward yields an analytically tractable partition function, removing the need for relative signals to cancel this term. Moreover, QRPO scales with increased compute to estimate quantile rewards, opening a new dimension for pre-computation scaling. Empirically, QRPO consistently achieves top performance on chat and coding evaluations -- reward model scores, AlpacaEval 2, and LeetCode -- compared to DPO, REBEL, and SimPO across diverse datasets and 8B-scale models. Finally, we find that training with robust rewards instead of converting them to preferences induces less length bias.
Problem

Research questions and friction points this paper is trying to address.

Bridging gap between online and offline reward learning methods
Enabling pointwise reward learning with offline data simplicity
Reducing length bias by using robust reward signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

QRPO learns from pointwise absolute rewards offline
Quantile rewards enable closed-form KL-regularized solution
Scales compute for pre-computation of quantile rewards