🤖 AI Summary
DPO suffers from weak preference response generation capability and poor result robustness in aligning large language models (LLMs) with human values. To address these limitations, we propose Self-Guided DPO (SG-DPO), the first method to introduce a pilot-term gradient control mechanism into the DPO loss function—enabling theoretically interpretable, fine-grained reward update regulation and thereby enhancing the stability and consistency of preference learning. SG-DPO extends the DPO paradigm by integrating gradient-guided design with theory-driven loss reconstruction. We empirically validate our approach across multiple LLMs and diverse benchmarks, including HellaSwag and Anthropic-HH. Experimental results demonstrate that SG-DPO achieves an average 9.19% improvement over baseline methods on mainstream preference evaluation benchmarks. Crucially, theoretical analysis aligns closely with empirical findings, confirming significant gains in generalization and robustness.
📝 Abstract
Direct Preference Optimization (DPO) is broadly utilized for aligning Large Language Models (LLMs) with human values because of its flexibility. Despite its effectiveness, it has been observed that the capability of DPO to generate human-preferred response is limited and the results of DPO are far from resilient. To address these limitations, in this paper we propose a novel Self-Guided Direct Preference Optimization algorithm, i.e., SGDPO, which incorporates a pilot term to steer the gradient flow during the optimization process, allowing for fine-grained control over the updates of chosen and rejected rewards. We provide a detailed theoretical analysis of our proposed method and elucidate its operational mechanism. Furthermore, we conduct comprehensive experiments on various models and benchmarks. The extensive experimental results demonstrate the consistency between the empirical results and our theoretical analysis and confirm the effectiveness of our proposed approach (up to 9.19% higher score).