SAID: Empowering Large Language Models with Self-Activating Internal Defense

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Large language models (LLMs) remain vulnerable to jailbreaking attacks, while existing external defenses suffer from poor generalizability, high computational overhead, and detrimental impacts on model performance. To address these challenges, this paper proposes SAID—a training-free, intrinsic safety defense paradigm. Rather than imposing external interventions, SAID activates the model’s inherent safety mechanisms through a three-stage process: native intent distillation, optimal safety prefix discovery, and conservative aggregation—enabling self-triggered, robust protection. Crucially, SAID requires no fine-tuning or additional parameters. Evaluated across five open-source LLMs and six state-of-the-art jailbreaking attack types, SAID substantially reduces harmful outputs while preserving near-baseline performance on benign tasks. It exhibits strong generalizability, minimal latency overhead, and excellent scalability—offering a practical, deployable solution for real-world LLM safety.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering or output modification, which often lack generalizability and compromise model utility while incurring significant computational overhead. In this work, we introduce a new, training-free defense paradigm, Self-Activating Internal Defense (SAID), which reframes the defense task from external correction to internal capability activation. SAID uniquely leverages the LLM's own reasoning abilities to proactively identify and neutralize malicious intent through a three-stage pipeline: model-native intent distillation to extract core semantics, optimal safety prefix probing to activate latent safety awareness, and a conservative aggregation strategy to ensure robust decision-making. Extensive experiments on five open-source LLMs against six advanced jailbreak attacks demonstrate that SAID substantially outperforms state-of-the-art defenses in reducing harmful outputs. Crucially, it achieves this while preserving model performance on benign tasks and incurring minimal computational overhead. Our work establishes that activating the intrinsic safety mechanisms of LLMs is a more robust and scalable path toward building safer and more reliable aligned AI systems.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM security against jailbreak attacks without external interventions

Activating internal safety mechanisms to identify and neutralize malicious intent

Maintaining model utility on benign tasks while minimizing computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activates LLM's internal safety reasoning abilities

Uses three-stage pipeline for intent distillation

Ensures robust decision-making with minimal overhead

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance