SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work proposes SafeThinker, a novel framework that integrates risk-aware reasoning throughout the entire generation process to address the vulnerability of current large language models to adversarial attacks—particularly stealthy ones like prompt injection—while preserving utility. SafeThinker employs a lightweight gateway classifier to dynamically assess input risk and routes queries to one of three tailored mechanisms based on the risk level: standardized refusal for explicit threats, a Safety-Aware Two-Expert (SATE) module to intercept disguised attacks, and Distribution-Guided Decoding with Thresholding (DDGT) to adaptively intervene during generation under uncertainty. By enabling dynamic allocation of defensive resources, SafeThinker significantly reduces attack success rates across diverse jailbreaking scenarios without compromising model helpfulness, thereby achieving a robust balance between safety and practicality.

Technology Category

Application Category

📝 Abstract

Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models vulnerable to disguised attacks (e.g., prefilling) while degrading utility. To bridge this gap, we propose SafeThinker, an adaptive framework that dynamically allocates defensive resources via a lightweight gateway classifier. Based on the gateway's risk assessment, inputs are routed through three distinct mechanisms: (i) a Standardized Refusal Mechanism for explicit threats to maximize efficiency; (ii) a Safety-Aware Twin Expert (SATE) module to intercept deceptive attacks masquerading as benign queries; and (iii) a Distribution-Guided Think (DDGT) component that adaptively intervenes during uncertain generation. Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising utility, demonstrating that coordinating intrinsic judgment throughout the generation process effectively balances robustness and practicality.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

disguised attacks

jailbreak

risk-awareness

LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

SafeThinker

risk-awareness

adaptive defense