Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

📅 2024-05-30

🏛️ arXiv.org

📈 Citations: 21

✨ Influential: 2

career value

203K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreaking attacks, which exploit input prompts to elicit harmful or undesired outputs. Method: This paper proposes a training-free, interpretable suffix-based defense mechanism that enhances input robustness without fine-tuning or modifying model parameters. It strategically appends lightweight, human-readable suffix prompts to user inputs and validates efficacy via multi-round adaptive attack evaluations (e.g., GCG, AutoDAN). Contribution/Results: To our knowledge, this is the first approach achieving Pareto-optimal trade-off between attack success rate (ASR) reduction and task utility preservation. On LLaMA-2-7B-Chat and Mistral-7B-Instruct, it reduces ASR by over 72% on average while incurring less than 0.8% utility degradation. The method demonstrates strong cross-model generalizability—requiring no model-specific adaptation—and significantly outperforms existing defenses in both robustness and efficiency.

Technology Category

Application Category

📝 Abstract

Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed interpretable suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and interpretable solution applicable to various LLM platforms.

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against jailbreak attacks effectively

Balancing safety and utility in LLM defenses

Providing interpretable and scalable jailbreak protection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses interpretable suffix prompts for defense

Minimizes attack success rate without utility loss

Scalable solution for various LLM platforms

🔎 Similar Papers

No similar papers found.