Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) suffer from safety alignment failures in multi-step reasoning: although they maintain refusal intent during the thinking phase, refusal scores sharply decline—termed the “refusal cliff”—in the final few tokens before output, leading to harmful content generation. This work formally identifies and characterizes this phenomenon for the first time. Using linear probing to track refusal intent and causal intervention analysis to localize critical inhibitory attention heads, we demonstrate that masking only 3% of attention heads significantly improves safety. Furthermore, we propose Cliff-as-a-Judge, a novel data selection method that leverages the refusal cliff as an implicit safety signal. With merely 1.7% of conventional safety training data, it reduces adversarial attack success rates to below 10%, validating a new “less-is-more” paradigm for efficient and effective safety alignment.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as extbf{refusal cliff}: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3% of these heads can reduce attack success rates below 10%. Building on these mechanistic insights, we propose extbf{Cliff-as-a-Judge}, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.
Problem

Research questions and friction points this paper is trying to address.

Investigating why safety alignment fails in multi-step reasoning models
Identifying refusal cliff phenomenon where refusal intentions drop sharply
Developing efficient data selection method to repair safety alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear probing traces refusal intentions across tokens
Ablating negative attention heads reduces attack success
Cliff-as-a-Judge selects data for efficient safety repair
🔎 Similar Papers
No similar papers found.
Q
Qingyu Yin
Zhejiang University
C
Chak Tou Leong
Hong Kong Polytechnic University
Linyi Yang
Linyi Yang
Southern University of Science and Technology
Natural Language ProcessingMachine LearningAI for Research
Wenxuan Huang
Wenxuan Huang
CUHK & ECNU
Artificial General IntelligenceMLLMLLMAIGCModel Acceleration
W
Wenjie Li
Hong Kong Polytechnic University
Xiting Wang
Xiting Wang
Associate Professor, Renmin University of China
Explainable AIAI AlignmentVisual AnalyticsTrustworthy AIReasoning
J
Jaehong Yoon
Nanyang Technological University
Y
YunXing
Xiaohongshu Inc.
X
XingYu
Xiaohongshu Inc.
J
Jinjin Gu
INSAIT