Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work identifies a novel security vulnerability in Large Language Diffusion Models (LLDMs) under jailbreaking attacks: existing jailbreaking methods designed for autoregressive LLMs fail to transfer effectively due to fundamental architectural differences. To address this, the authors are the first to empirically demonstrate LLDMs’ susceptibility to evasion attacks and propose the Parallel Attention-guided Decoding (PAD) framework—a novel attack that dynamically injects harmful semantics during the diffusion process via multi-point attention guidance, thereby precisely steering text generation trajectories. PAD circumvents the constraints of parallel decoding and non-autoregressive architecture without modifying model parameters. Evaluated on four state-of-the-art LLDMs, PAD achieves a 97% attack success rate. Moreover, at comparable scale, it generates harmful content twice as fast as autoregressive models, highlighting its severe potential for misuse.

Technology Category

Application Category

📝 Abstract

Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning tasks.The precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety vulnerabilities.Successful defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based architectures.To address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural differences.We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD introduces Multi-Point Attention Attack, which guides parallel generative processes toward harmful outputs that inspired by affirmative response patterns in LLMs. Experimental evaluations across four LLDMs demonstrate that PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities. Furthermore, compared to autoregressive LLMs of the same size, LLDMs increase the harmful generation speed by 2x, significantly highlighting risks of uncontrolled misuse.Through comprehensive analysis, we provide an investigation into LLDM architecture, offering critical insights for the secure deployment of diffusion-based language models.

Problem

Research questions and friction points this paper is trying to address.

Exposes safety flaws in diffusion-based text generation models

Evaluates jailbreak effectiveness on Large Language Diffusion Models

Proposes attack method revealing high vulnerability in LLDMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

PArallel Decoding jailbreak (PAD) for LLDMs

Multi-Point Attention Attack guides harmful outputs

Reveals LLDM vulnerabilities with 97% success rate

🔎 Similar Papers

No similar papers found.