🤖 AI Summary
Large language models (LLMs) remain highly vulnerable to jailbreak attacks, while existing red-teaming methods suffer from limited effectiveness or scalability. To address this, we propose JBFuzz—the first automated black-box fuzzing framework specifically designed for LLM jailbreak detection. JBFuzz introduces three key innovations: (1) a semantics-guided seed prompt generation strategy, (2) a lightweight mutation engine, and (3) a multidimensional evaluator integrating response semantic analysis with safety scoring. Evaluated on nine mainstream LLMs, JBFuzz achieves an average jailbreak success rate of 99% with only 60 seconds per attack—substantially outperforming state-of-the-art baselines. Our findings expose critical alignment failures in current safety-aligned models, underscoring persistent security vulnerabilities. JBFuzz establishes a new, efficient, and scalable paradigm for robustness evaluation of LLMs, enabling systematic, large-scale adversarial testing without access to model internals or training data.
📝 Abstract
Large language models (LLMs) have shown great promise as language understanding and decision making tools, and they have permeated various aspects of our everyday life. However, their widespread availability also comes with novel risks, such as generating harmful, unethical, or offensive content, via an attack called jailbreaking. Despite extensive efforts from LLM developers to align LLMs using human feedback, they are still susceptible to jailbreak attacks. To tackle this issue, researchers often employ red-teaming to understand and investigate jailbreak prompts. However, existing red-teaming approaches lack effectiveness, scalability, or both. To address these issues, we propose JBFuzz, a novel effective, automated, and scalable red-teaming technique for jailbreaking LLMs. JBFuzz is inspired by the success of fuzzing for detecting bugs/vulnerabilities in software. We overcome three challenges related to effectiveness and scalability by devising novel seed prompts, a lightweight mutation engine, and a lightweight and accurate evaluator for guiding the fuzzer. Assimilating all three solutions results in a potent fuzzer that only requires black-box access to the target LLM. We perform extensive experimental evaluation of JBFuzz using nine popular and widely-used LLMs. We find that JBFuzz successfully jailbreaks all LLMs for various harmful/unethical questions, with an average attack success rate of 99%. We also find that JBFuzz is extremely efficient as it jailbreaks a given LLM for a given question in 60 seconds on average. Our work highlights the susceptibility of the state-of-the-art LLMs to jailbreak attacks even after safety alignment, and serves as a valuable red-teaming tool for LLM developers.