SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional DPO methods employ a single global temperature parameter β, which struggles to accommodate the heterogeneity of semantic differences in preference pairs—such as the mixture of factual errors and stylistic preferences—and is sensitive to label noise. This work proposes SP²DPO, the first approach to introduce an instance-level temperature scheduling mechanism based on semantic disparity. By leveraging a teacher language model to offline annotate each preference pair with its category, magnitude, and confidence, SP²DPO assigns a distinct temperature parameter β_i per sample pair. This enhances both the precision and auditability of preference learning without increasing training overhead. Using a large-scale β_i artifact constructed on UltraFeedback, experiments show that two out of four student models (4B–8B scale) surpass the tuned global-β baseline in length-controlled win rates on AlpacaEval 2.0, without requiring per-model hyperparameter tuning.

Technology Category

Application Category

📝 Abstract
Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.
Problem

Research questions and friction points this paper is trying to address.

Direct Preference Optimization
preference heterogeneity
label noise
semantic gap
temperature scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Per-Pair DPO
instance-specific temperature
preference optimization
teacher-annotated semantic gap
length-controlled evaluation
🔎 Similar Papers
No similar papers found.