XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Current large language model (LLM) safety moderation mechanisms suffer from poor interpretability and undetectable alignment vulnerabilities. Method: We propose a contrastive attribution–based explainable AI (XAI) framework that jointly applies feature attribution (e.g., Integrated Gradients), behavioral comparison between moderated and unmoderated models, targeted adversarial noise optimization, and multi-round feedback distillation to systematically uncover latent alignment patterns within moderation logic. Contribution/Results: This work pioneers the use of XAI for jailbreak attack design, yielding the first interpretable and reproducible targeted jailbreaking framework. Evaluated on leading closed- and open-weight LLMs, it achieves a 37% improvement in attack success rate while generating human-understandable explanations of vulnerability root causes. Our approach establishes a novel paradigm for rigorous AI safety evaluation and robust alignment auditing.

Technology Category

Application Category

📝 Abstract

Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. In response to this, LLM Jailbreaking is a significant threat to such protections, and many previous approaches have already demonstrated its effectiveness across diverse domains. Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft malicious input. To improve the comprehension of censoring mechanisms and design a targeted jailbreak attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel jailbreak attack that exploits these unique patterns to break the security constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our attack.

Problem

Research questions and friction points this paper is trying to address.

Analyzing alignment patterns in censored vs uncensored LLMs

Developing targeted jailbreak attacks via explainable AI

Breaking LLM security constraints through noise injection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable AI analyzes censored and uncensored models

XBreaking exploits alignment patterns via noise injection

Targeted jailbreak attack improves comprehension of censoring

🔎 Similar Papers

Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era