Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) suffer from vulnerability to jailbreaking attacks, poor robustness against out-of-distribution harmful inputs, and uninterpretable refusal rationales. To address these issues, this paper proposes the Safety Chain-of-Thought (SCoT) defense framework. SCoT integrates safety judgment into the model’s internal reasoning process via intent-driven, chain-of-thought safety reasoning, enabling proactive identification and decomposition of potential harms in inputs. It further combines rule-aware refusal generation with an enhanced refusal training dataset to achieve fine-grained harm attribution and interpretable refusals. Experiments demonstrate that SCoT significantly reduces vulnerability rates across diverse jailbreaking attack benchmarks, outperforming state-of-the-art defenses. Crucially, it preserves the model’s original language capabilities while exhibiting strong generalization robustness against unseen harmful scenarios and long-tail adversarial examples.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced extit{reasoning capabilities} of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Safety and Robustness

Explainable Refusal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Secure Cognitive Chain

Large Language Models

Adaptive Defense Strategies

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance