Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

πŸ“… 2025-01-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) suffer from vulnerability to jailbreaking attacks, poor robustness against out-of-distribution harmful inputs, and uninterpretable refusal rationales. To address these issues, this paper proposes the Safety Chain-of-Thought (SCoT) defense framework. SCoT integrates safety judgment into the model’s internal reasoning process via intent-driven, chain-of-thought safety reasoning, enabling proactive identification and decomposition of potential harms in inputs. It further combines rule-aware refusal generation with an enhanced refusal training dataset to achieve fine-grained harm attribution and interpretable refusals. Experiments demonstrate that SCoT significantly reduces vulnerability rates across diverse jailbreaking attack benchmarks, outperforming state-of-the-art defenses. Crucially, it preserves the model’s original language capabilities while exhibiting strong generalization robustness against unseen harmful scenarios and long-tail adversarial examples.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced extit{reasoning capabilities} of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Safety and Robustness
Explainable Refusal
Innovation

Methods, ideas, or system contributions that make the work stand out.

Secure Cognitive Chain
Large Language Models
Adaptive Defense Strategies