Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing preference optimization (PO) methods rely on heuristic strategies for negative sample construction, lacking theoretical grounding. This paper reformulates PO as minimizing the negative log-likelihood (NLL) of a reward model, revealing for the first time its implicit NLL objective. Building on this insight, we propose a theoretically principled negative example generation mechanism based on contrastive divergence (CD) and Monte Carlo kernel sampling. Leveraging this, we design two algorithms: MC-PO (batch) and OnMC-PO (online), which provide rigorous theoretical justification for hard negative sampling and enable efficient, scalable online optimization. Experiments on mainstream alignment benchmarks—including HH-RLHF and UltraFeedback—demonstrate that both methods significantly outperform state-of-the-art approaches. Specifically, MC-PO substantially improves convergence speed, while OnMC-PO achieves sustained and stable gains in preference alignment throughout training.

Technology Category

Application Category

📝 Abstract
Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
Problem

Research questions and friction points this paper is trying to address.

Optimizing preference using contrastive divergence
Theoretical guidance for sampling dispreferred completions
Monte Carlo kernel for sampling hard negatives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimizes negative log-likelihood for optimization
Uses contrastive divergence sampling strategy
Extends to online setting with OnMC-PO
🔎 Similar Papers
No similar papers found.