Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
This work investigates which characteristics of preference data genuinely enhance reasoning capabilities when aligning language models via existing preference optimization methods such as DPO and KTO. It is the first to disentangle and quantify two distinct sources of variation in preference pairs: generator-level disparity and sample-level disparity. The study proposes a dual-strategy approach—amplifying generator-level disparity while filtering for high sample-level disparity—to construct more effective preference datasets. Experiments leverage models of varying scales to generate preference pairs and employ LLM-as-a-judge evaluations across multiple reasoning dimensions. Ablation studies demonstrate that increasing generator-level disparity substantially improves out-of-domain reasoning performance, whereas filtering by sample-level disparity enhances data efficiency and reduces training costs.

Technology Category

Application Category

📝 Abstract
Preference optimization methods such as DPO and KTO are widely used for aligning language models, yet little is understood about what properties of preference data drive downstream reasoning gains. We ask: what aspects of a preference pair improve a reasoning model's performance on general reasoning tasks? We investigate two distinct notions of quality delta in preference data: generator-level delta, arising from the differences in capability between models that generate chosen and rejected reasoning traces, and sample-level delta, arising from differences in judged quality differences within an individual preference pair. To study generator-level delta, we vary the generator's scale and model family, and to study sample-level delta, we employ an LLM-as-a-judge to rate the quality of generated traces along multiple reasoning-quality dimensions. We find that increasing generator-level delta steadily improves performance on out-of-domain reasoning tasks and filtering data by sample-level delta can enable more data-efficient training. Our results suggest a twofold recipe for improving reasoning performance through preference optimization: maximize generator-level delta when constructing preference pairs and exploit sample-level delta to select the most informative training examples.
Problem

Research questions and friction points this paper is trying to address.

preference optimization
reasoning performance
preference pairs
generator-level delta
sample-level delta
Innovation

Methods, ideas, or system contributions that make the work stand out.

preference optimization
reasoning alignment
generator-level delta
sample-level delta
data efficiency
🔎 Similar Papers
No similar papers found.