Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM jailbreaking defenses are largely confined to single-turn, monolingual settings, and their taxonomies either lack comprehensiveness or overemphasize risk categories while neglecting the underlying attack techniques. To address this, we conduct red-teaming exercises to construct a multilingual, multi-turn adversarial dialogue dataset. We propose the first hierarchical jailbreaking taxonomy covering seven major attack families and fifty distinct strategies, and release the first annotated Italian multi-turn jailbreaking dataset. Methodologically, we integrate taxonomy-guided prompting, multi-turn dialogue modeling, and human-curated fine-grained annotation to build an interpretable detection framework. Experiments demonstrate that our taxonomy significantly improves detection of progressive and cross-lingual jailbreaking behaviors; taxonomy-guided prompting yields measurable performance gains. This work advances jailbreaking defense toward systematic, fine-grained, and interpretable solutions.

Technology Category

Application Category

📝 Abstract
Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.
Problem

Research questions and friction points this paper is trying to address.

Developing a comprehensive taxonomy of 50 jailbreak strategies for LLMs
Analyzing prevalence and success rates of different attack types
Benchmarking detection methods and creating multilingual adversarial datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed hierarchical taxonomy of 50 jailbreak strategies
Benchmarked LLM detection with taxonomy-guided prompting approach
Compiled annotated Italian dataset of multi-turn dialogues
🔎 Similar Papers
No similar papers found.
O
Olga E. Sorokoletova
Department of Computer, Control and Management Engineering Sapienza University of Rome
F
Francesco Giarrusso
Department of Computer, Control and Management Engineering Sapienza University of Rome
Vincenzo Suriani
Vincenzo Suriani
Sapienza University of Rome
Daniele Nardi
Daniele Nardi
Sapienza Univ. Roma, Dept. Computer, Control and Management Engineering
Artificial IntelligenceRoboticsMulti Agent Systems