Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing LLM jailbreaking defenses are largely confined to single-turn, monolingual settings, and their taxonomies either lack comprehensiveness or overemphasize risk categories while neglecting the underlying attack techniques. To address this, we conduct red-teaming exercises to construct a multilingual, multi-turn adversarial dialogue dataset. We propose the first hierarchical jailbreaking taxonomy covering seven major attack families and fifty distinct strategies, and release the first annotated Italian multi-turn jailbreaking dataset. Methodologically, we integrate taxonomy-guided prompting, multi-turn dialogue modeling, and human-curated fine-grained annotation to build an interpretable detection framework. Experiments demonstrate that our taxonomy significantly improves detection of progressive and cross-lingual jailbreaking behaviors; taxonomy-guided prompting yields measurable performance gains. This work advances jailbreaking defense toward systematic, fine-grained, and interpretable solutions.

Technology Category

Application Category

📝 Abstract

Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

Problem

Research questions and friction points this paper is trying to address.

Developing a comprehensive taxonomy of 50 jailbreak strategies for LLMs

Analyzing prevalence and success rates of different attack types

Benchmarking detection methods and creating multilingual adversarial datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed hierarchical taxonomy of 50 jailbreak strategies

Benchmarked LLM detection with taxonomy-guided prompting approach

Compiled annotated Italian dataset of multi-turn dialogues

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models