🤖 AI Summary
This work addresses the challenge of circumventing safety guardrails in large language models (LLMs). We propose a novel black-box jailbreaking method based on semantic obfuscation: prohibited queries are embedded into semantically coherent and benign carrier texts, leveraging self-attention mechanisms to selectively activate target semantic neurons while suppressing safety-response pathways. Our approach innovatively integrates hypernym expansion with context-aware prompt engineering to enable end-to-end generation of carrier texts. Crucially, it requires no internal model access or gradient information. Evaluated on JailbreakBench, our method achieves an average jailbreaking success rate of 63% across four mainstream LLMs—substantially outperforming existing black-box baselines. It further exhibits high stealthiness, strong cross-model transferability, and practical scalability, offering a robust and deployable framework for probing LLM safety vulnerabilities.
📝 Abstract
Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt injection techniques to produce the payload prompt. We evaluate our approach using JailbreakBench, testing against four target models across 100 distinct jailbreak objectives. The experimental results demonstrate our method's superior effectiveness, achieving an average success rate of 63% across all target models, significantly outperforming existing blackbox jailbreak methods.