Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles

📅 2024-08-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of circumventing safety guardrails in large language models (LLMs). We propose a novel black-box jailbreaking method based on semantic obfuscation: prohibited queries are embedded into semantically coherent and benign carrier texts, leveraging self-attention mechanisms to selectively activate target semantic neurons while suppressing safety-response pathways. Our approach innovatively integrates hypernym expansion with context-aware prompt engineering to enable end-to-end generation of carrier texts. Crucially, it requires no internal model access or gradient information. Evaluated on JailbreakBench, our method achieves an average jailbreaking success rate of 63% across four mainstream LLMs—substantially outperforming existing black-box baselines. It further exhibits high stealthiness, strong cross-model transferability, and practical scalability, offering a robust and deployable framework for probing LLM safety vulnerabilities.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) jailbreak refers to a type of attack aimed to bypass the safeguard of an LLM to generate contents that are inconsistent with the safe usage guidelines. Based on the insights from the self-attention computation process, this paper proposes a novel blackbox jailbreak approach, which involves crafting the payload prompt by strategically injecting the prohibited query into a carrier article. The carrier article maintains the semantic proximity to the prohibited query, which is automatically produced by combining a hypernymy article and a context, both of which are generated from the prohibited query. The intuition behind the usage of carrier article is to activate the neurons in the model related to the semantics of the prohibited query while suppressing the neurons that will trigger the objectionable text. Carrier article itself is benign, and we leveraged prompt injection techniques to produce the payload prompt. We evaluate our approach using JailbreakBench, testing against four target models across 100 distinct jailbreak objectives. The experimental results demonstrate our method's superior effectiveness, achieving an average success rate of 63% across all target models, significantly outperforming existing blackbox jailbreak methods.

Problem

Research questions and friction points this paper is trying to address.

Bypass LLM safeguards

Inject prohibited queries

Enhance jailbreak success rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inject prohibited queries strategically

Generate benign carrier articles automatically

Leverage prompt injection for payload

🔎 Similar Papers

Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation