DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

📅 2024-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the security vulnerability of large language models (LLMs) to jailbreak attacks and their propensity to generate harmful content, this paper proposes an end-to-end generative prompt rewriting method. The approach formulates prompt obfuscation as a controlled denoising process. Its key contributions are: (1) the first sequence-to-sequence text diffusion model for semantic-preserving, controllable perturbation of prompts; (2) an attack-oriented loss function that steers the denoising trajectory toward eliciting harmful model outputs; and (3) Gumbel-Softmax–based differentiable discrete sampling, overcoming the token-by-token constraint inherent in autoregressive rewriting. Evaluated on AdvBench and HarmBench, the method achieves state-of-the-art attack success rates (ASR), while significantly improving output fluency and diversity—outperforming existing suffix-injection and template-based jailbreaking techniques across all metrics.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications. This approach preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model's output distribution differentiable, eliminating the need for iterative token search. Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Harmful Content Generation
Model Safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiffusionAttacker
Gumbel-Softmax
Content Moderation
🔎 Similar Papers
No similar papers found.
H
Hao Wang
Beihang University, Beijing, China
H
Hao Li
Beihang University, Beijing, China
J
Junda Zhu
Beihang University, Beijing, China
X
Xinyuan Wang
Beihang University, Beijing, China
Chengwei Pan
Chengwei Pan
Beihang University
Virtual RealityComputer GraphicsComputer VisionMedical Image ProcessingDeep Learning
M
Minlie Huang
Tsinghua University, Beijing, China; Zhongguancun Laboratory, Beijing, China
Lei Sha
Lei Sha
Prof@Beihang University, Prof@ZGC Lab, Oxtium AI, University of Oxford
NLPML