Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Large language models often fail to maintain safety alignment when confronted with indirect or deceptive jailbreak attacks due to their limited capacity for deep reasoning. This work addresses this vulnerability by introducing causal intervention analysis to expose the superficial nature of current alignment mechanisms. To overcome this limitation, the authors propose a novel reasoning-driven paradigm that constructs safety-focused fine-tuning datasets incorporating step-by-step reasoning chains. This approach integrates Chain-of-Thought fine-tuning with an Alignment-Weighted Direct Preference Optimization (DPO) algorithm, which applies distinct preference weights to the reasoning and answer components of model outputs. Empirical evaluations demonstrate that the proposed method significantly enhances robustness against diverse jailbreak attacks across multiple safety benchmarks while preserving—or even surpassing—the general performance of standard supervised fine-tuning and conventional DPO.

Technology Category

Application Category

📝 Abstract

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

safety alignment

shallow alignment

reasoning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Alignment-Weighted DPO

Chain-of-Thought reasoning

safety alignment