Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large language models (LLMs) remain vulnerable to novel jailbreaking attacks, while existing defenses struggle to balance generalizability and practical deployability. Method: This paper proposes a training-free, dynamic retrieval-augmented defense framework—the first to integrate retrieval-augmented generation (RAG) into jailbreak mitigation. It maintains an incrementally updatable attack signature database and performs real-time identification of malicious query strategies and associated risks via semantic matching and intent inference. Contribution/Results: The framework enables plug-and-play extension of attack patterns and offers a tunable safety-utility trade-off mechanism. Experiments on the StrongREJECT benchmark demonstrate substantial reductions in success rates of strong jailbreaking attacks—including PAP and PAIR—while maintaining low false rejection rates on benign queries. These results validate the method’s effectiveness, robustness, and operational feasibility for real-world deployment.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against evolving jailbreak attacks without retraining

Balancing safety and utility in adaptive defense systems

Detecting malicious queries using known attack strategy database

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation for jailbreak detection

Training-free updates for new attack strategies

Controllable safety-utility trade-off mechanism

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance