🤖 AI Summary
This work proposes Structured Semantic Camouflage (S2C), a novel jailbreaking framework that targets the semantic integration process of malicious intent in large language models. Unlike conventional attacks relying on superficial obfuscation, S2C exploits the models’ reliance on deep semantic understanding and in-context reasoning by disrupting their ability to coherently interpret harmful instructions. The approach employs a tripartite mechanism—context restructuring, content fragmentation, and cue-guided obfuscation—to leverage long-range coreference resolution and multi-step reasoning capabilities, thereby evading safety detectors while preserving instruction recoverability. Experimental results demonstrate that S2C significantly enhances attack success rates by 12.4% on HarmBench and 9.7% on JBB-Behaviors, and outperforms the strongest baseline by 26% on GPT-5-mini, establishing a new state-of-the-art in evasion-based red-teaming strategies.
📝 Abstract
Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.