Comprehensive Assessment of Jailbreak Attacks Against LLMs

📅 2024-02-08

🏛️ arXiv.org

📈 Citations: 70

✨ Influential: 3

career value

240K/year

🤖 AI Summary

Existing research on LLM jailbreaking attacks lacks standardized benchmarks and systematic evaluation. Method: This work introduces the first large-scale empirical evaluation framework, encompassing 17 state-of-the-art jailbreaking techniques, 8 mainstream aligned models, and 16 distinct policy-violation categories. It proposes the first structured taxonomy of jailbreak attacks and establishes a standardized evaluation protocol across models, attack methods, and violation types. Contribution/Results: Experiments reveal pervasive vulnerabilities: all evaluated models—including Llama-3—exhibit significant susceptibility, with peak attack success rates reaching 0.88; none of the eight advanced external defenses achieves complete mitigation; and every model fails catastrophically on at least one violation category. These findings expose fundamental limitations of current alignment methodologies, providing critical empirical evidence to guide the development of robust alignment strategies and effective defense mechanisms.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks aim to bypass the safeguards of LLMs. While researchers have studied different jailbreak attacks in depth, they have done so in isolation -- either with unaligned experiment settings or comparing a limited range of methods. To fill this gap, we present the first large-scale measurement of various jailbreak attack methods. We collect 17 cutting-edge jailbreak methods, summarize their features, and establish a novel jailbreak attack taxonomy. Based on eight popular censored LLMs and 160 questions from 16 violation categories, we conduct a unified and impartial assessment of attack effectiveness as well as a comprehensive ablation study. Our extensive experimental results demonstrate that all the jailbreak attacks have a powerful effect on the LLMs. This indicates that all LLMs fail to cover all the violation categories, and they are susceptible to significant jailbreak risks, with even the well-aligned Llama3 facing a maximum attack success rate of 0.88. Additionally, we test jailbreak attacks under eight advanced external defenses and find none of the defenses could mitigate the jailbreak attacks entirely. Our study offers valuable insights for future research on jailbreak attacks and defenses and serves as a benchmark tool for researchers and practitioners to evaluate them effectively.

Problem

Research questions and friction points this paper is trying to address.

Evaluate diverse jailbreak attacks on aligned LLMs comprehensively

Assess attack effectiveness and defense robustness systematically

Establish taxonomy and patterns for jailbreak research advancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale evaluation of jailbreak attacks

Novel jailbreak attack taxonomy

Comprehensive measurement across aligned LLMs

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation