The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the lack of reproducible, large-scale infrastructure for systematically generating, categorizing, and evaluating jailbreak attacks against large language models. The authors construct a dataset of 114,000 jailbreak prompts annotated according to cybersecurity attack taxonomies and propose a category-aware, controllable generation method that operates without templates or gradient-based search. They further introduce OPTIMUS, the first training-free, continuous evaluation metric that overcomes the limitations of traditional binary scoring. Experimental results demonstrate that the proposed generation approach achieves perplexity scores of 24–39 and jailbreak success rates of 0.29–0.51 against safety filters. Moreover, OPTIMUS effectively discriminates among weak, moderate, and strong attack intensities, outperforming existing evaluation methods on large-scale data.

📝 Abstract

Jailbreak attacks -- adversarial prompts that bypass LLM alignment through purely linguistic manipulation -- pose a growing operational security threat, yet the field lacks large-scale, reproducible infrastructure for generating, categorizing, and evaluating them systematically. This paper addresses that gap with three contributions. (1) Large-scale compositional jailbreak dataset. We construct 114,000 adversarial prompts by applying 912 composing strategies to 125 harmful seed prompts from JailBreakV-28K. Every prompt is assigned to one of 14 cybersecurity attack categories (e.g., malware, phishing, privilege escalation) via a six-model majority-vote pipeline, and each strategy is ranked by effectiveness per category, enabling principled strategy selection grounded in concrete adversarial objectives. (2) Automated jailbreak generation. We instruction-fine-tune category-aware LLMs on Moderate and Optimal subsets, producing models that synthesize fluent jailbreak prompts from a harmful seed at inference time -- no templates, no gradient search. Our generators achieve perplexity 24-39 versus 40-140 for AutoDAN and AmpleGCG, with safety-filter evasion rates of 0.29-0.51 Mal (LlamaPromptGuard-2-86M), enabling controllable, scalable red-teaming under realistic adversarial conditions. (3) OPTIMUS: a training-free jailbreak evaluator. OPTIMUS is a continuous metric J(S,H) that jointly captures semantic similarity between the harmful seed and the jailbreak (S) and harmfulness probability (H) via calibrated penalty functions. Unlike binary attack success rate (ASR), OPTIMUS requires no task-specific training, generalizes across evolving strategies, and exposes a stealth-optimal regime (S*=0.57, H*=0.43) that ASR misses. Experiments across 114,000 prompts confirm that OPTIMUS separates Weak, Moderate, and Optimal jailbreaks with category-level evidence binary evaluation cannot supply.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

LLM security

adversarial prompts

systematic evaluation

large-scale infrastructure

Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak attacks

adversarial prompting

large-scale dataset