Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address the performance bottleneck of Direct Preference Optimization (DPO) caused by scarcity of high-quality training samples in preference learning, this paper proposes an adaptive sampling strategy grounded in the probability space of a reference model. Our key insight—novelly established herein—is that the output probability distribution of the reference model inherently reflects sample-wise preference discriminability, obviating the need for additional annotations, external models, or human intervention. By analyzing this probability space to identify highly discriminative samples and integrating dynamic filtering with joint DPO optimization, our method significantly reduces data dependency while improving training efficiency. Experiments demonstrate consistent gains: +0.1–0.4 average improvement on MT-Bench; +0.4–0.98 on technical tasks including programming, mathematics, and reasoning; and superior performance over full-data baselines using only 30%–50% of the original training data.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. In particular, clear quality differences between preferred and rejected responses enhance learning performance. Current methods for identifying and obtaining such high-quality samples demand additional resources or external models. We discover that reference model probability space naturally detects high-quality training samples. Using this insight, we present a sampling strategy that achieves consistent improvements (+0.1 to +0.4) on MT-Bench while using less than half (30-50%) of the training data. We observe substantial improvements (+0.4 to +0.98) for technical tasks (coding, math, and reasoning) across multiple models and hyperparameter settings.

Problem

Research questions and friction points this paper is trying to address.

Direct Preference Optimization

Language Model Training

Limited High-Quality Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference Model Selection

Data Optimization

Direct Preference Optimization (DPO)

🔎 Similar Papers

Towards Automatic Sampling of User Behaviors for Sequential Recommender Systems