Pre-Trained Policy Discriminators are General Reward Models

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor generalization of reward models to arbitrary target policies in reward modeling. We propose POLAR, a framework that reformulates reward modeling as policy discrimination: it pretrains a reward model to quantify relative differences among policies, yielding transferable reward signals. Methodologically, we introduce the first policy-discriminative pretraining paradigm, integrating large-scale contrastive learning with relative preference optimization to train 1.8B–7B-parameter reward models, seamlessly integrated into RLHF fine-tuning. Our key contribution is the first zero-shot relative difference modeling capability—enabling reward models to assess arbitrary target policies without task-specific adaptation—along with strong generalization and an interpretable compute-performance power-law relationship. Experiments demonstrate +26.2% and +27.6% improvements in preference accuracy on STEM and creative writing tasks, respectively, and significantly enhanced alignment performance for LLaMA3.1-8B and Qwen2.5-32B across 20 benchmarks.

Technology Category

Application Category

📝 Abstract
We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
Problem

Research questions and friction points this paper is trying to address.

Formulating reward modeling as policy discrimination for desired behaviors
Proposing scalable pre-training method POLAR for reward models
Enhancing reward model performance and generalization in RLHF
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Discriminative Learning (POLAR) for reward modeling
Captures relative difference between arbitrary target policies
Scalable pre-training with power-law computation-performance relationship
🔎 Similar Papers
No similar papers found.