BadDLM: Backdooring Diffusion Language Models with Diverse Targets

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the underexplored vulnerability of diffusion language models (DLMs) to backdoor attacks. We propose BadDLM, the first framework to reveal a DLM-specific backdoor mechanism that leverages trigger-aware training objectives to steer the forward masking distribution, enabling precise backdoor implantation during fine-tuning. BadDLM supports diverse attack objectives—including concept injection, semantic manipulation, alignment bypass, and code payload embedding—demonstrating high effectiveness across mainstream open-source DLMs. Experimental results show that BadDLM achieves strong attack success rates while preserving normal model performance and evading existing defenses designed for autoregressive language models.
📝 Abstract
Diffusion language models (DLMs) have recently emerged as an alternative modeling paradigm to autoregressive (AR) language models, enabling parallel generation and bidirectional context modeling. Yet their security implications, particularly their vulnerability to backdoor attacks, remain underexplored. We propose BadDLM, a unified framework for studying backdoor attacks against DLMs with diverse targets. We introduce a trigger-aware training objective that emphasizes target-relevant positions in poisoned samples, and theoretically prove that this objective is equivalent to training under an induced forward masking distribution. Unlike backdoors in autoregressive models, which typically manipulate next-token prediction, this characterization indicates that BadDLM can implant backdoors by exploiting the forward masking process. We instantiate BadDLM across different target levels: concept injection (BadDLM_Concept), semantic attribute steering (BadDLM_Attribute), alignment bypass (BadDLM_Align), and code payload injection (BadDLM_Payload). Experiments on mainstream open-source DLMs show that BadDLM achieves strong attack effectiveness across diverse targets while largely preserving benign utility, and remains effective against defenses designed for AR backdoors. Our findings expose a new class of security risks in diffusion-based language generation and call for defenses tailored to DLM denoising dynamics.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Language Models
Backdoor Attacks
Security Risks
Diverse Targets
Model Vulnerability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Backdoor Attack
Diffusion Language Models
Trigger-aware Training
Forward Masking
Model Security
🔎 Similar Papers
No similar papers found.
Shengfang Zhai
Shengfang Zhai
Peking University
Trustworthy AIGenerative ModelsAI PrivacyBackdoor Attacks
X
Xiaoyang Ji
Peking University
Y
Yuling Shi
Shanghai Jiao Tong University
H
Haoran Gao
Jiutian Research
F
Fanyu Meng
Jiutian Research
Y
Yan Zeng
Universal Database
Y
Yuejian Fang
Peking University
Yinpeng Dong
Yinpeng Dong
Tsinghua University
Machine LearningDeep LearningAI Safety
Jiaheng Zhang
Jiaheng Zhang
Assistant Professor, National University of Singapore.
Zero-knowledge proofsAI safetyApplied cryptographyBlockchain