🤖 AI Summary
Existing jailbreaking attacks against large language models (LLMs) largely overlook the role of prompt structural organization, focusing instead on lexical or syntactic manipulation. Method: This work systematically investigates how textual organization impacts jailbreaking efficacy and introduces Unconventional Textual Organization Structures (UTOS), grounded in long-tailed distribution principles. We design 12 UTOS templates integrated with six character- and context-coordinated obfuscation strategies, forming StructuralSleight—a three-level, fully automated, end-to-end jailbreaking framework that jointly obfuscates prompts across structural, character-level, and contextual granularities. Contribution/Results: StructuralSleight is the first jailbreaking tool targeting prompt structure as a primary attack surface. Evaluated on GPT-4o, it achieves a 94.62% attack success rate—significantly outperforming state-of-the-art baselines—and establishes a novel paradigm and benchmark for LLM safety evaluation.
📝 Abstract
Large Language Models (LLMs) are widely used in natural language processing but face the risk of jailbreak attacks that maliciously induce them to generate harmful content. Existing jailbreak attacks, including character-level and context-level attacks, mainly focus on the prompt of plain text without specifically exploring the significant influence of its structure. In this paper, we focus on studying how the prompt structure contributes to the jailbreak attack. We introduce a novel structure-level attack method based on long-tailed structures, which we refer to as Uncommon Text-Organization Structures (UTOS). We extensively study 12 UTOS templates and 6 obfuscation methods to build an effective automated jailbreak tool named StructuralSleight that contains three escalating attack strategies: Structural Attack, Structural and Character/Context Obfuscation Attack, and Fully Obfuscated Structural Attack. Extensive experiments on existing LLMs show that StructuralSleight significantly outperforms the baseline methods. In particular, the attack success rate reaches 94.62% on GPT-4o, which has not been addressed by state-of-the-art techniques.