๐ค AI Summary
Existing LLM jailbreaking evaluations suffer from hallucination-induced false positives, leading to inflated safety risk assessments. Method: This work systematically identifies this confounding issue and introduces BabyBLUEโa novel benchmark comprising a multi-evaluator verification framework (integrating rule-based checks, LLM-as-judge scoring, and executable simulation) and a dedicated dataset. It establishes the first three-dimensional verification paradigm jointly assessing semantic plausibility, instruction executability, and societal harm. Additionally, it proposes adversarial prompt rewriting analysis, a hallucination attribution classification model, and a human-in-the-loop validation protocol. Contribution/Results: Applied to mainstream LLMs, BabyBLUE reveals that 38.2% of purported โjailbreakโ instances are hallucinatory; it reduces false positive rates by 52.7%, significantly enhancing the reliability of red-teaming evaluations and the effectiveness of downstream defenses.
๐ Abstract
"Jailbreak"is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $ extbf{B}$enchmark for reli$ extbf{AB}$ilit$ extbf{Y}$ and jail$ extbf{B}$reak ha$ extbf{L}$l$ extbf{U}$cination $ extbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.