Diffusion LLMs are Natural Adversaries for any LLM

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the problem of efficient adversarial prompt generation against black-box large language models (LLMs). We propose an amortized optimization framework based on Diffusion LLMs, which reframes prompt search as a non-autoregressive conditional generation task—marking the first use of diffusion-based LLMs to explicitly model the joint prompt-response distribution. This enables few-shot, high-efficiency adversarial prompt synthesis. Our method integrates probability-guided parallel sampling with perplexity constraints, yielding prompts that exhibit low perplexity, high diversity, and strong cross-model transferability. Experiments demonstrate significant improvements in jailbreaking success rates across diverse robust fine-tuning variants and proprietary black-box LLMs. The approach establishes a scalable, generalizable paradigm for red-teaming and automated prompt-based adversarial testing.

Technology Category

Application Category

📝 Abstract

We introduce a novel framework that transforms the resource-intensive (adversarial) prompt optimization problem into an emph{efficient, amortized inference task}. Our core insight is that pretrained, non-autoregressive generative LLMs, such as Diffusion LLMs, which model the joint distribution over prompt-response pairs, can serve as powerful surrogates for prompt search. This approach enables the direct conditional generation of prompts, effectively replacing costly, per-instance discrete optimization with a small number of parallelizable samples. We provide a probabilistic analysis demonstrating that under mild fidelity assumptions, only a few conditional samples are required to recover high-reward (harmful) prompts. Empirically, we find that the generated prompts are low-perplexity, diverse jailbreaks that exhibit strong transferability to a wide range of black-box target models, including robustly trained and proprietary LLMs. Beyond adversarial prompting, our framework opens new directions for red teaming, automated prompt optimization, and leveraging emerging Flow- and Diffusion-based LLMs.

Problem

Research questions and friction points this paper is trying to address.

Transforms adversarial prompt optimization into efficient inference

Uses Diffusion LLMs as surrogates for conditional prompt generation

Generates transferable jailbreak prompts for black-box LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms prompt optimization into efficient inference

Uses Diffusion LLMs for conditional prompt generation

Replaces discrete optimization with parallelizable sampling

🔎 Similar Papers

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs