🤖 AI Summary
This work proposes SafeThinker, a novel framework that integrates risk-aware reasoning throughout the entire generation process to address the vulnerability of current large language models to adversarial attacks—particularly stealthy ones like prompt injection—while preserving utility. SafeThinker employs a lightweight gateway classifier to dynamically assess input risk and routes queries to one of three tailored mechanisms based on the risk level: standardized refusal for explicit threats, a Safety-Aware Two-Expert (SATE) module to intercept disguised attacks, and Distribution-Guided Decoding with Thresholding (DDGT) to adaptively intervene during generation under uncertainty. By enabling dynamic allocation of defensive resources, SafeThinker significantly reduces attack success rates across diverse jailbreaking scenarios without compromising model helpfulness, thereby achieving a robust balance between safety and practicality.
📝 Abstract
Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models vulnerable to disguised attacks (e.g., prefilling) while degrading utility. To bridge this gap, we propose SafeThinker, an adaptive framework that dynamically allocates defensive resources via a lightweight gateway classifier. Based on the gateway's risk assessment, inputs are routed through three distinct mechanisms: (i) a Standardized Refusal Mechanism for explicit threats to maximize efficiency; (ii) a Safety-Aware Twin Expert (SATE) module to intercept deceptive attacks masquerading as benign queries; and (iii) a Distribution-Guided Think (DDGT) component that adaptively intervenes during uncertain generation. Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising utility, demonstrating that coordinating intrinsic judgment throughout the generation process effectively balances robustness and practicality.