The Great Pretender: A Stochasticity Problem in LLM Jailbreak

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study addresses the high instability of attack success rate (ASR) in current jailbreaking evaluations, which stems from randomness in both generation and evaluation phases, leading to unreliable results and poor cross-study comparability. To mitigate this issue, the authors systematically analyze the impact of such randomness and propose a more robust framework comprising CAS-eval for consistent evaluation and CAS-gen for stable attack generation, along with novel metrics to quantify attack consistency. Comprehensive experiments across multiple models, attack methods, and evaluators demonstrate that ASR can drop by up to 30 percentage points across repeated trials, whereas CAS-gen substantially reduces this performance variance, significantly enhancing the stability and reproducibility of jailbreaking attacks.

📝 Abstract

"Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking, is not a stable quantity. Second, published ASR numbers are therefore systematically inflated and incomparable across papers. Therefore, we wonder "Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?". To answer this question, we study the impact of stochasticity not only during attack evaluation but also during attack generation. Our evaluation includes several jailbreak attacks, models (different sizes and providers), and judges. In addition, we propose a new metric and two new frameworks (CAS-eval and CAS-gen). Our evaluation framework, CAS-eval, shows that an attack can have an ASR drop of up to 30 percentage points when a jailbreak prompt needs to succeed on more than one attempt. Thankfully, our attack generation framework (CAS-gen) improves previous jailbreak methods and helps them recover this loss of 30 percentage points!

Problem

Research questions and friction points this paper is trying to address.

stochasticity

jailbreak

Attack Success Rate

LLM

evaluation reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

stochasticity

jailbreak robustness

Attack Success Rate (ASR)