KL Penalty Control via Perturbation for Direct Preference Optimization

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Static KL regularization in Direct Preference Optimization (DPO) often causes excessive policy deviation from the reference model, while existing dynamic KL methods lack fine-grained, preference-pair–level adaptivity. Method: We propose ε-DPO, the first DPO variant enabling instance-level dynamic tuning of the KL coefficient β. Leveraging logit monotonicity, ε-DPO estimates lightweight, parameter-free β values per preference pair by analyzing the perturbation sensitivity of current and reference logits—without auxiliary networks or additional training overhead. It seamlessly integrates into the standard DPO framework. Contribution/Results: On multiple general-purpose chat benchmarks, ε-DPO consistently outperforms diverse DPO variants and KL-relaxation methods. Empirical results demonstrate that instance-level KL adaptivity is critical for improving alignment performance, validating both the efficacy and practicality of our approach.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods try to turn this static KL penalty into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $varepsilon$-Direct Preference Optimization ($varepsilon$-DPO), which allows adaptive control of the KL penalty strength $eta$ for each preference pair. Specifically, $varepsilon$-DPO adaptively controls $eta$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $eta$ during training by simply reusing the logit of the current policy and the reference policy. Experimental results show that $varepsilon$-DPO outperforms existing direct alignment algorithms and KL penalty relaxation methods on general chatbot benchmarks, highlighting the significance of adaptive KL penalty relaxation at the instance-level in DPO.

Problem

Research questions and friction points this paper is trying to address.

Adaptive KL penalty control

Dynamic preference alignment

Instance-level optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive KL penalty control

Perturbation-based preference modeling

Instance-level penalty relaxation

🔎 Similar Papers

DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning