General Preference Reinforcement Learning

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing alignment methods for large language models struggle with unverifiable rewards in open-domain tasks, insufficient exploration in preference optimization, and training collapse caused by scalar reward models. To address these challenges, this work proposes Generalized Preference Reinforcement Learning (GPRL), which leverages a Generalized Preference Model (GPM) to embed responses into a multidimensional skew-symmetric subspace. GPRL enables multidimensional preference-driven policy updates through dimension-aware advantage estimation, normalization, and context-dependent eigenvalue aggregation. The approach further introduces a closed-loop drift monitoring mechanism and an adaptive trust region strategy that dynamically reweights preference dimensions to mitigate overfitting and reward hacking. Evaluated on Llama-3-8B-Instruct, GPRL achieves a 56.51% win rate on AlpacaEval 2.0 and significantly outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench, while effectively suppressing reward hacking during extended training.

📝 Abstract

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

Problem

Research questions and friction points this paper is trying to address.

Preference Reinforcement Learning

Reward Hacking

Multi-dimensional Quality

Open-ended Generation

Large Language Model Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

General Preference Model

Preference Reinforcement Learning

Multi-dimensional Reward