🤖 AI Summary
Current large language model (LLM) alignment methods rely on frozen, standalone reward models, resulting in complex pipelines, high computational overhead, and performance bottlenecks due to static reward signals. To address these limitations, we propose URPO (Unified Reward and Policy Optimization), a framework that jointly integrates instruction following and reward modeling within a single LLM. URPO employs Groupwise Relative Policy Optimization (GRPO) to enable co-evolution of generation and evaluation. Innovatively, it unifies the “player” (policy) and “judge” (reward) roles, casting preference data, verifiable reasoning traces, and open-ended instructions into a consistent generative format for end-to-end joint training. Evaluated on Qwen2.5-7B, URPO significantly outperforms strong baselines: achieving 44.84 on AlpacaEval 2.0 (+2.6), 35.66 on comprehensive reasoning benchmarks, and 85.15 on RewardBench—surpassing dedicated reward models.
📝 Abstract
Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following ("player") and reward modeling ("referee") within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO's superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.