Adaptive Margin RLHF via Preference over Preferences

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing RLHF methods typically employ fixed or score-dependent margins, overlooking variations in the intensity of human preferences and relying on scarce, precise preference scores. To address this, we propose DPO-PoP, a Direct Preference Optimization variant that incorporates “preference over preferences” (PoP) annotations to model sample-level adaptive margins—inferring margin magnitudes directly from relative intensity signals without requiring explicit human scoring. This approach transcends conventional margin modeling constraints and, for the first time, uncovers the intrinsic trade-off between discriminative accuracy and generative alignment. Evaluated on the UltraFeedback dataset, DPO-PoP substantially outperforms standard DPO and variants with fixed or ground-truth margins, achieving simultaneous improvements in preference discrimination accuracy and policy generation quality. The method enhances both the generalizability and fidelity of reward modeling.

Technology Category

Application Category

📝 Abstract

Margin-based optimization is fundamental to improving generalization and robustness in classification tasks. In the context of reward model learning from preferences within Reinforcement Learning from Human Feedback (RLHF), existing methods typically rely on no margins, fixed margins, or margins that are simplistic functions of preference ratings. However, such formulations often fail to account for the varying strengths of different preferences, for example some preferences are associated with larger margins between responses, or they rely on noisy margin information derived from ratings. We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment. Furthermore, many existing methods that use adaptive margins assume access to accurate preference scores, which can be difficult for humans to provide reliably. We propose an approach that leverages preferences over preferences, that is annotations indicating which of two preferences reflects a stronger distinction. We use this ordinal signal to infer adaptive margins on a per-datapoint basis. We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision, enabling improved discriminative and generative performance. Empirically, our method outperforms vanilla DPO, DPO with fixed margins, and DPO with ground-truth margins on the UltraFeedback dataset. Additionally, we show that there is a tradeoff between discriminative and generative performance: improving test classification accuracy, particularly by correctly labeling weaker preferences at the expense of stronger ones, can lead to a decline in generative quality. To navigate this tradeoff, we propose two sampling strategies to gather preference-over-preference labels: one favoring discriminative performance and one favoring generative performance.

Problem

Research questions and friction points this paper is trying to address.

Adaptive margin modeling for preference strength in RLHF

Leveraging ordinal preference-over-preference signals

Balancing discriminative and generative performance tradeoffs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive margins inferred from preference-over-preference annotations

Extension of DPO incorporating ordinal preference strength signals

Sampling strategies balancing discriminative and generative performance

🔎 Similar Papers

No similar papers found.