🤖 AI Summary
This work investigates which characteristics of preference data genuinely enhance reasoning capabilities when aligning language models via existing preference optimization methods such as DPO and KTO. It is the first to disentangle and quantify two distinct sources of variation in preference pairs: generator-level disparity and sample-level disparity. The study proposes a dual-strategy approach—amplifying generator-level disparity while filtering for high sample-level disparity—to construct more effective preference datasets. Experiments leverage models of varying scales to generate preference pairs and employ LLM-as-a-judge evaluations across multiple reasoning dimensions. Ablation studies demonstrate that increasing generator-level disparity substantially improves out-of-domain reasoning performance, whereas filtering by sample-level disparity enhances data efficiency and reduces training costs.
📝 Abstract
Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model's performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator's scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.