Improving Safety Alignment via Balanced Direct Preference Optimization

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses a critical issue in safety alignment of large language models: severe overfitting caused by an imbalance in the model’s understanding of positive and negative samples within preference data, which compromises safety. For the first time, this study identifies and analyzes this imbalance from the perspective of how models interpret preference data. To mitigate this problem, the authors propose Balanced Direct Preference Optimization (B-DPO), a novel approach that integrates mutual information metrics into the DPO framework to dynamically adjust the optimization weights of positive and negative responses, enabling adaptive alignment. Experimental results demonstrate that B-DPO significantly outperforms existing methods across multiple mainstream safety benchmarks, effectively enhancing model safety without sacrificing general capabilities.

Technology Category

Application Category

📝 Abstract

With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \color{red}{Warning: This paper contains examples of harmful texts, and reader discretion is recommended.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

overfitting

preference comprehension

Large Language Models

Direct Preference Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced Direct Preference Optimization

safety alignment

preference comprehension imbalance