SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

DPO suffers from weak preference response generation capability and poor result robustness in aligning large language models (LLMs) with human values. To address these limitations, we propose Self-Guided DPO (SG-DPO), the first method to introduce a pilot-term gradient control mechanism into the DPO loss function—enabling theoretically interpretable, fine-grained reward update regulation and thereby enhancing the stability and consistency of preference learning. SG-DPO extends the DPO paradigm by integrating gradient-guided design with theory-driven loss reconstruction. We empirically validate our approach across multiple LLMs and diverse benchmarks, including HellaSwag and Anthropic-HH. Experimental results demonstrate that SG-DPO achieves an average 9.19% improvement over baseline methods on mainstream preference evaluation benchmarks. Crucially, theoretical analysis aligns closely with empirical findings, confirming significant gains in generalization and robustness.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) is broadly utilized for aligning Large Language Models (LLMs) with human values because of its flexibility. Despite its effectiveness, it has been observed that the capability of DPO to generate human-preferred response is limited and the results of DPO are far from resilient. To address these limitations, in this paper we propose a novel Self-Guided Direct Preference Optimization algorithm, i.e., SGDPO, which incorporates a pilot term to steer the gradient flow during the optimization process, allowing for fine-grained control over the updates of chosen and rejected rewards. We provide a detailed theoretical analysis of our proposed method and elucidate its operational mechanism. Furthermore, we conduct comprehensive experiments on various models and benchmarks. The extensive experimental results demonstrate the consistency between the empirical results and our theoretical analysis and confirm the effectiveness of our proposed approach (up to 9.19% higher score).

Problem

Research questions and friction points this paper is trying to address.

Improving DPO's limited human-preferred response generation

Enhancing resilience of DPO alignment results

Introducing self-guided gradient control for fine-grained reward updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-guided gradient flow control for DPO

Pilot term steers chosen and rejected rewards

Enhances alignment with human values significantly

🔎 Similar Papers

Is Free Self-Alignment Possible?