Improving Safety Alignment via Balanced Direct Preference Optimization

๐Ÿ“… 2026-03-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses a critical issue in safety alignment of large language models: severe overfitting caused by an imbalance in the modelโ€™s understanding of positive and negative samples within preference data, which compromises safety. For the first time, this study identifies and analyzes this imbalance from the perspective of how models interpret preference data. To mitigate this problem, the authors propose Balanced Direct Preference Optimization (B-DPO), a novel approach that integrates mutual information metrics into the DPO framework to dynamically adjust the optimization weights of positive and negative responses, enabling adaptive alignment. Experimental results demonstrate that B-DPO significantly outperforms existing methods across multiple mainstream safety benchmarks, effectively enhancing model safety without sacrificing general capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \color{red}{Warning: This paper contains examples of harmful texts, and reader discretion is recommended.
Problem

Research questions and friction points this paper is trying to address.

safety alignment
overfitting
preference comprehension
Large Language Models
Direct Preference Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced Direct Preference Optimization
safety alignment
preference comprehension imbalance
mutual information
overfitting mitigation
๐Ÿ”Ž Similar Papers
No similar papers found.
Shiji Zhao
Shiji Zhao
Beihang University
Machine LearningTrustworthy AIExplainable AIRobust AI
M
Mengyang Wang
Institute of Artificial Intelligence, Beihang University, Beijing, China
S
Shukun Xiong
Institute of Artificial Intelligence, Beihang University, Beijing, China
F
Fangzhou Chen
Institute of Artificial Intelligence, Beihang University, Beijing, China
Q
Qihui Zhu
Institute of Artificial Intelligence, Beihang University, Beijing, China
S
Shouwei Ruan
Institute of Artificial Intelligence, Beihang University, Beijing, China
Yisong Xiao
Yisong Xiao
BUAA
Ranjie Duan
Ranjie Duan
Alibaba Group
AIAI ๅฎ‰ๅ…จAIๆŽจๅŠจๅ…ฑๅŒๅฏŒ่ฃ•
X
Xun Chen
Institute of Artificial Intelligence, Beihang University, Beijing, China
X
XingXing Wei
Institute of Artificial Intelligence, Beihang University, Beijing, China