Safety Alignment Should Be Made More Than Just A Few Attention Heads

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit over-reliance on a small set of critical attention heads for safety alignment, rendering them vulnerable to jailbreaking attacks via adversarial prompts. We identify that these heads collectively encode a concentrated “refusal direction” in the representation space—constituting a single-point vulnerability in the safety mechanism. To address this, we propose a two-stage approach: (1) RDSHA (Refusal Direction-based Safety Head Analysis), which identifies safety-critical attention heads by measuring their alignment with the refusal direction vector; and (2) AHD (Attention Head Diversification), a training strategy that explicitly encourages distributed representation of safety behavior across a broader set of attention heads. Experiments demonstrate that AHD significantly improves robustness against mainstream jailbreaking attacks while preserving generation quality and functional capabilities. To our knowledge, this is the first work to enable learnable and scalable distributed encoding of safety capabilities across multiple attention heads.

Technology Category

Application Category

📝 Abstract
Current safety alignment for large language models(LLMs) continues to present vulnerabilities, given that adversarial prompting can effectively bypass their safety measures.Our investigation shows that these safety mechanisms predominantly depend on a limited subset of attention heads: removing or ablating these heads can severely compromise model safety. To identify and evaluate these safety-critical components, we introduce RDSHA, a targeted ablation method that leverages the model's refusal direction to pinpoint attention heads mostly responsible for safety behaviors. Further analysis shows that existing jailbreak attacks exploit this concentration by selectively bypassing or manipulating these critical attention heads. To address this issue, we propose AHD, a novel training strategy designed to promote the distributed encoding of safety-related behaviors across numerous attention heads. Experimental results demonstrate that AHD successfully distributes safety-related capabilities across more attention heads. Moreover, evaluations under several mainstream jailbreak attacks show that models trained with AHD exhibit considerably stronger safety robustness, while maintaining overall functional utility.
Problem

Research questions and friction points this paper is trying to address.

LLM safety alignment relies on few vulnerable attention heads
Jailbreak attacks exploit concentrated safety mechanisms in models
Distributed safety encoding needed for robust adversarial resistance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Targeted ablation method RDSHA identifies safety-critical attention heads
AHD training strategy distributes safety encoding across multiple heads
Enhanced safety robustness against jailbreak attacks while maintaining utility
🔎 Similar Papers
No similar papers found.
C
Chao Huang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Zefeng Zhang
Zefeng Zhang
Institute of Information Engineering,Chinese Academy of Sciences
Natural Language Processing
J
Juewei Yue
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Q
Quangang Li
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Chuang Zhang
Chuang Zhang
Tsinghua University
Autonomous DrivingIntelligent Connected Vehicle
Tingwen Liu
Tingwen Liu
Institute of Information Engineering, Chinese Academy of Sciences
Content SecurityNatural Language ProcessingKnowledge Graph