Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses three critical gaps in defending large language models (LLMs) against jailbreaking attacks: fragmented defense methodologies, unsystematic evaluation protocols, and poor out-of-distribution (OOD) generalization. To this end, we introduce the first unified, cross-style and cross-distribution evaluation framework for systematically assessing the robustness of 15 mainstream safety guardrails under diverse prompt injection attacks. Our methodology comprises standardized malicious/benign datasets, a multi-dimensional adversarial prompt benchmark, defense response consistency analysis, and principled OOD generalization metrics. Key findings reveal pervasive attack-style bias across existing guardrails; notably, several simple baseline methods surpass state-of-the-art defenses by 12–28% in accuracy under OOD conditions. This study exposes fundamental limitations in current defense evaluation practices and establishes a reproducible, scalable benchmark—grounded in empirical evidence—to advance robust alignment research.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to. Additionally, we show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance compared to many state-of-the-art defences. Code is available at https://github.com/IBM/Adversarial-Prompt-Evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluate guardrails against LLM prompt attacks

Systematically benchmark 15 different defence methods

Assess performance across diverse jailbreak styles

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic benchmarking of 15 defences

Evaluates against diverse jailbreak styles

Compares simple baselines to state-of-art

🔎 Similar Papers

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs