🤖 AI Summary
During large-scale preference dataset construction, DPO performance degrades as sample size increases, primarily due to the failure of conventional “highest–lowest reward” pairing strategies.
Method: We propose a novel pairing paradigm grounded in statistical properties of the reward distribution. Theoretically and empirically, we show that the optimal rejection sample lies at μ−2σ—i.e., two standard deviations below the mean—rather than at distributional extremes. Assuming approximate normality of rewards, we partition the reward space into seven quantile-based intervals and systematically evaluate all pairwise combinations to identify high-quality preference pairs, dynamically integrated into the DPO objective.
Contribution/Results: Evaluated on AlpacaEval 2 across four state-of-the-art LLMs, our method consistently improves alignment performance; gains scale robustly with increasing sample size. This work establishes the first scalable, statistically principled, and robust framework for preference data construction.
📝 Abstract
Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to emph{scale up} the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a emph{decline} in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 ($C_7^2$) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position $mu - 2sigma$ rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.