Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
During large-scale preference dataset construction, DPO performance degrades as sample size increases, primarily due to the failure of conventional “highest–lowest reward” pairing strategies. Method: We propose a novel pairing paradigm grounded in statistical properties of the reward distribution. Theoretically and empirically, we show that the optimal rejection sample lies at μ−2σ—i.e., two standard deviations below the mean—rather than at distributional extremes. Assuming approximate normality of rewards, we partition the reward space into seven quantile-based intervals and systematically evaluate all pairwise combinations to identify high-quality preference pairs, dynamically integrated into the DPO objective. Contribution/Results: Evaluated on AlpacaEval 2 across four state-of-the-art LLMs, our method consistently improves alignment performance; gains scale robustly with increasing sample size. This work establishes the first scalable, statistically principled, and robust framework for preference data construction.

Technology Category

Application Category

📝 Abstract
Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to emph{scale up} the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a emph{decline} in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 ($C_7^2$) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position $mu - 2sigma$ rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
Problem

Research questions and friction points this paper is trying to address.

Enhance alignment performance via on-policy sample scaling.
Address performance decline in Direct Preference Optimization.
Introduce scalable preference data construction strategy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repeated random sampling
Seven reward categories
Optimal rejected selection
🔎 Similar Papers
No similar papers found.
Y
Yao Xiao
Singapore University of Technology and Design, Shanda AI Research Institute
Hai Ye
Hai Ye
MiroMind AI; National University of Singapore
Natural Language Processing
L
Linyao Chen
The University of Tokyo
Hwee Tou Ng
Hwee Tou Ng
Provost's Chair Professor of Computer Science, National University of Singapore
Natural Language ProcessingComputational Linguistics
L
Li Bing
Shanda AI Research Institute
X
Xiaoli Li
Institute for Infocomm Research, A*Star, Singapore
R
Roy Ka-wei Lee
Singapore University of Technology and Design