SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

📅 2025-05-26
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with safety requirements suffers from multi-stage training complexity, reliance on auxiliary reward or cost models, and high computational overhead. Method: This paper proposes SafeDPO—a single-stage, direct preference optimization framework that seamlessly integrates safety constraints into the DPO objective. It introduces only one tunable hyperparameter, eliminating the need for reinforcement learning, separate safety classifiers, or online sampling. By reformulating the DPO loss, explicitly modeling safety preferences, and implicitly incorporating cost-based safety constraints, SafeDPO jointly optimizes for both safety and human preference alignment. Contribution/Results: SafeDPO achieves state-of-the-art performance across multiple safety and preference alignment benchmarks, significantly improving suppression of harmful responses while preserving DPO’s inherent training stability and efficiency.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM safety without complex RLHF procedures
Optimizing safety alignment in single-stage policy learning
Simplifying safety constraints integration in DPO framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage safety optimization without relaxation
Minimal hyperparameter addition for enhanced safety
Eliminates separate reward and cost models
🔎 Similar Papers
No similar papers found.