🤖 AI Summary
Existing preference optimization (PO) methods rely on heuristic strategies for negative sample construction, lacking theoretical grounding. This paper reformulates PO as minimizing the negative log-likelihood (NLL) of a reward model, revealing for the first time its implicit NLL objective. Building on this insight, we propose a theoretically principled negative example generation mechanism based on contrastive divergence (CD) and Monte Carlo kernel sampling. Leveraging this, we design two algorithms: MC-PO (batch) and OnMC-PO (online), which provide rigorous theoretical justification for hard negative sampling and enable efficient, scalable online optimization. Experiments on mainstream alignment benchmarks—including HH-RLHF and UltraFeedback—demonstrate that both methods significantly outperform state-of-the-art approaches. Specifically, MC-PO substantially improves convergence speed, while OnMC-PO achieves sustained and stable gains in preference alignment throughout training.
📝 Abstract
Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.