SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SafeThinker, a novel framework that integrates risk-aware reasoning throughout the entire generation process to address the vulnerability of current large language models to adversarial attacks—particularly stealthy ones like prompt injection—while preserving utility. SafeThinker employs a lightweight gateway classifier to dynamically assess input risk and routes queries to one of three tailored mechanisms based on the risk level: standardized refusal for explicit threats, a Safety-Aware Two-Expert (SATE) module to intercept disguised attacks, and Distribution-Guided Decoding with Thresholding (DDGT) to adaptively intervene during generation under uncertainty. By enabling dynamic allocation of defensive resources, SafeThinker significantly reduces attack success rates across diverse jailbreaking scenarios without compromising model helpfulness, thereby achieving a robust balance between safety and practicality.

Technology Category

Application Category

📝 Abstract
Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models vulnerable to disguised attacks (e.g., prefilling) while degrading utility. To bridge this gap, we propose SafeThinker, an adaptive framework that dynamically allocates defensive resources via a lightweight gateway classifier. Based on the gateway's risk assessment, inputs are routed through three distinct mechanisms: (i) a Standardized Refusal Mechanism for explicit threats to maximize efficiency; (ii) a Safety-Aware Twin Expert (SATE) module to intercept deceptive attacks masquerading as benign queries; and (iii) a Distribution-Guided Think (DDGT) component that adaptively intervenes during uncertain generation. Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising utility, demonstrating that coordinating intrinsic judgment throughout the generation process effectively balances robustness and practicality.
Problem

Research questions and friction points this paper is trying to address.

safety alignment
disguised attacks
jailbreak
risk-awareness
LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

SafeThinker
risk-awareness
adaptive defense
Safety-Aware Twin Expert
Distribution-Guided Think
🔎 Similar Papers
No similar papers found.
X
Xianya Fang
Nanjing University of Aeronautics and Astronautics
X
Xianying Luo
Nanjing University of Aeronautics and Astronautics
Y
Yadong Wang
Nanjing University of Aeronautics and Astronautics
Xiang Chen
Xiang Chen
Nanjing University of Science and Technology
Computer VisionImage ProcessingArtificial IntelligenceDeep Learning
Y
Yu Tian
Institute for AI, Tsinghua University
Zequn Sun
Zequn Sun
Nanjing University
Knowledge GraphLarge Language Model
R
Rui Liu
Didi International Business Group
J
Jun Fang
Didi International Business Group
N
Naiqiang Tan
Didi International Business Group
Yuanning Cui
Yuanning Cui
Nanjing University of Information Science and Technology
Graph Machine LearningKnowledge GraphGraph Foundation ModelLLMs
Sheng-Jun Huang
Sheng-Jun Huang
Nanjing University of Aeronautics & Astronautics
Machine Learning