The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work identifies a novel security vulnerability in diffusion-based large language models (dLLMs): their susceptibility to context-aware masked adversarial prompts, against which existing alignment mechanisms offer no robust defense. To exploit this weakness, we propose DIJA—a novel jailbreaking framework that systematically leverages dLLMs’ intrinsic bidirectional contextual modeling and parallel decoding capabilities. DIJA constructs adversarial interleaved masked inputs without rewriting or concealing malicious content, thereby bypassing conventional defenses rooted in instruction rewriting or output filtering. Experiments demonstrate DIJA’s effectiveness: it achieves a 100% attack success rate (ASR) on Dream-Instruct; improves ASR by 78.5% over ReNeLLM on JailbreakBench; and increases StrongREJECT scores by 37.7 points—collectively exposing critical security boundary limitations in current dLLMs.

Technology Category

Application Category

📝 Abstract

Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

Problem

Research questions and friction points this paper is trying to address.

Identifies safety vulnerabilities in diffusion-based LLMs from masked adversarial prompts.

Proposes DIJA framework to exploit dLLMs' bidirectional modeling weaknesses.

Demonstrates alignment mechanisms fail to prevent harmful completions in dLLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

DIJA exploits dLLMs' bidirectional modeling vulnerabilities

Adversarial interleaved mask-text prompts bypass alignment

Parallel decoding limits unsafe content filtering

🔎 Similar Papers

No similar papers found.