DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work identifies a novel security vulnerability inherent to diffusion-based large language models (dLLMs), stemming from their iterative parallel generation mechanism. We are the first to discover that dLLMs exhibit high susceptibility to jailbreaking attacks along both intra-step and inter-step dimensions, and we characterize two critical phenomena: greedy re-masking bias and denoising-path dependency—where early-token safety determines final output safety. To address this, we propose DiffuGuard, a training-free, two-stage defense framework: Stage I mitigates bias via stochastic annealing-based re-masking; Stage II enables dynamic, representation-level regulation through block-wise risk detection and autonomous repair. Extensive evaluation across four dLLMs and six jailbreaking attack variants demonstrates that DiffuGuard reduces attack success rate from 47.9% to 14.7%, while preserving generation quality and inference efficiency—marking the first systematic unlocking of dLLMs’ intrinsic security potential.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

Problem

Research questions and friction points this paper is trying to address.

Analyzes vulnerabilities in Diffusion LLMs to jailbreak attacks

Identifies harmful bias in greedy remasking and denoising-path dependence

Proposes training-free defense framework to enhance intrinsic safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic Annealing Remasking mitigates greedy selection bias

Block-level Audit and Repair detects and corrects risks autonomously

Training-free defense framework enhances safety without compromising utility

🔎 Similar Papers

No similar papers found.