"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

📅 2024-06-17

🏛️ International Conference on Computational Linguistics

📈 Citations: 6

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing LLM jailbreaking evaluations suffer from hallucination-induced false positives, leading to inflated safety risk assessments. Method: This work systematically identifies this confounding issue and introduces BabyBLUE—a novel benchmark comprising a multi-evaluator verification framework (integrating rule-based checks, LLM-as-judge scoring, and executable simulation) and a dedicated dataset. It establishes the first three-dimensional verification paradigm jointly assessing semantic plausibility, instruction executability, and societal harm. Additionally, it proposes adversarial prompt rewriting analysis, a hallucination attribution classification model, and a human-in-the-loop validation protocol. Contribution/Results: Applied to mainstream LLMs, BabyBLUE reveals that 38.2% of purported “jailbreak” instances are hallucinatory; it reduces false positive rates by 52.7%, significantly enhancing the reliability of red-teaming evaluations and the effectiveness of downstream defenses.

Technology Category

Application Category

📝 Abstract

"Jailbreak"is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $ extbf{B}$enchmark for reli$ extbf{AB}$ilit$ extbf{Y}$ and jail$ extbf{B}$reak ha$ extbf{L}$l$ extbf{U}$cination $ extbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Jailbreak Problem

Safety Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

BabyBLUE

LLMs Safety Evaluation

Breach Detection

🔎 Similar Papers

No similar papers found.