Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Despite safety alignment, large language models remain vulnerable to black-box jailbreaking attacks, and existing approaches suffer from limited interpretability and query inefficiency. This work formalizes, for the first time, the defense blind spots arising from the sparse distribution of safety-critical attention heads and proposes an efficient black-box jailbreaking framework based on textual obfuscation and distributional optimization. By integrating attention vulnerability modeling, iterative obfuscation sampling, and a black-box feedback mechanism, the method constructs an interpretable jailbreaking boundary model and dynamically refines the attack strategy. Experiments demonstrate attack success rates of 82.67% on GPT-4o and 78.33% on Claude-3.5-Haiku, achieving these results with an average of only 40 queries—significantly outperforming current state-of-the-art methods.

📝 Abstract

Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals. Comprehensive evaluations on frontier commercial models demonstrate that Babel achieves state-of-the-art attack success rates and superior query efficiency. Specifically, compared to state-of-the-art methods, Babel increases the attack success rate on GPT-4o from 41.33% to 82.67% and on Claude-3-5-haiku from 38.33% to 78.33% within an average of 40 queries, providing a robust red-teaming methodology for LLMs safety research.

Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks

safety alignment

large language models

black-box attacks

attention heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety attention

obfuscation sampling

jailbreak attack