AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

📅 2024-01-17
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of systematic and quantitative evaluation for jailbreak attacks against large language models (LLMs), this paper proposes the first dedicated quantitative assessment framework for jailbreak prompts. Methodologically, it moves beyond conventional binary robustness evaluation by introducing a dual-dimensional 0–1 scoring system: coarse-grained (success in bypassing safety alignment) and fine-grained (harm severity, semantic stealthiness, etc.). The framework integrates a multi-model collaborative scoring pipeline, adversarial prompt semantic similarity analysis, and human-in-the-loop verification. Additionally, we release the first high-quality, manually curated jailbreak prompt benchmark, covering diverse attack strategies. Experiments demonstrate that our framework significantly improves detection sensitivity—identifying high-risk jailbreak prompts missed by traditional methods—and substantially enhances LLM safety risk discovery capability.

Technology Category

Application Category

📝 Abstract
Ensuring the security of large language models (LLMs) against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.
Problem

Research questions and friction points this paper is trying to address.

Evaluate jailbreak attack effectiveness on large language models.
Develop frameworks for coarse-grained and fine-grained attack assessments.
Create a ground truth dataset for jailbreak prompt evaluation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces framework for jailbreak attack evaluation
Develops ground truth dataset for jailbreak prompts
Offers coarse and fine-grained evaluation frameworks
🔎 Similar Papers
D
Dong Shu
Northwestern University, USA
Mingyu Jin
Mingyu Jin
Ph.D Student on Computer Science, Rutgers University, New Brunswick
Natural Language ProcessingInterpretable Machine Learning
S
Suiyuan Zhu
Beichen Wang
Beichen Wang
PhD Candidate at Wageningen University & Research
Natural Language ProcessingInformation RetrievalComplex Network
Z
Zihao Zhou
University of Liverpool, China
C
Chong Zhang
University of Liverpool, UK
Y
Yongfeng Zhang
Rutgers University, USA