XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language model (LLM) safety moderation mechanisms suffer from poor interpretability and undetectable alignment vulnerabilities. Method: We propose a contrastive attribution–based explainable AI (XAI) framework that jointly applies feature attribution (e.g., Integrated Gradients), behavioral comparison between moderated and unmoderated models, targeted adversarial noise optimization, and multi-round feedback distillation to systematically uncover latent alignment patterns within moderation logic. Contribution/Results: This work pioneers the use of XAI for jailbreak attack design, yielding the first interpretable and reproducible targeted jailbreaking framework. Evaluated on leading closed- and open-weight LLMs, it achieves a 37% improvement in attack success rate while generating human-understandable explanations of vulnerability root causes. Our approach establishes a novel paradigm for rigorous AI safety evaluation and robust alignment auditing.

Technology Category

Application Category

📝 Abstract
Large Language Models are fundamental actors in the modern IT landscape dominated by AI solutions. However, security threats associated with them might prevent their reliable adoption in critical application scenarios such as government organizations and medical institutions. For this reason, commercial LLMs typically undergo a sophisticated censoring mechanism to eliminate any harmful output they could possibly produce. In response to this, LLM Jailbreaking is a significant threat to such protections, and many previous approaches have already demonstrated its effectiveness across diverse domains. Existing jailbreak proposals mostly adopt a generate-and-test strategy to craft malicious input. To improve the comprehension of censoring mechanisms and design a targeted jailbreak attack, we propose an Explainable-AI solution that comparatively analyzes the behavior of censored and uncensored models to derive unique exploitable alignment patterns. Then, we propose XBreaking, a novel jailbreak attack that exploits these unique patterns to break the security constraints of LLMs by targeted noise injection. Our thorough experimental campaign returns important insights about the censoring mechanisms and demonstrates the effectiveness and performance of our attack.
Problem

Research questions and friction points this paper is trying to address.

Analyzing alignment patterns in censored vs uncensored LLMs
Developing targeted jailbreak attacks via explainable AI
Breaking LLM security constraints through noise injection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable AI analyzes censored and uncensored models
XBreaking exploits alignment patterns via noise injection
Targeted jailbreak attack improves comprehension of censoring
Marco Arazzi
Marco Arazzi
Postdoc Researcher, DCALab at University of Pavia
AI SecurityAI PrivacyArtificial Intelligence
V
Vignesh Kumar Kembu
Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy
Antonino Nocera
Antonino Nocera
Associate Professor, University of Pavia
Artificial IntelligenceSecurityPrivacyData Science
P
P. Vinod
Department of Computer Applications, Cochin University of Science & Technology, India