Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work exposes a critical backdoor vulnerability in concept erasure techniques for large-scale text-to-image diffusion models. Addressing security flaws in existing unlearning methods—particularly their failure to safely remove harmful concepts (e.g., unauthorized identities or pornographic content)—we propose ToxE, a novel threat model that binds malicious concepts to critical model modules via stealthy triggers. We further design DISA (Deep Intervention Score Attack), the first end-to-end, parameter-level backdoor injection method targeting U-Net architectures. DISA achieves persistent contamination through text encoder poisoning, cross-attention layer manipulation, and Score Matching–driven optimization. Systematic evaluation across five state-of-the-art erasure algorithms reveals severe robustness failures: identity erasure backdoors achieve up to 82% activation success (57% average); pornographic concept erasure increases exposed body parts by up to 9×, with DISA inducing a 2.9× average increase. This is the first work to rigorously demonstrate a fundamental robustness gap in diffusion model concept erasure.

Technology Category

Application Category

📝 Abstract

The expansion of large-scale text-to-image diffusion models has raised growing concerns about their potential to generate undesirable or harmful content, ranging from fabricated depictions of public figures to sexually explicit images. To mitigate these risks, prior work has devised machine unlearning techniques that attempt to erase unwanted concepts through fine-tuning. However, in this paper, we introduce a new threat model, Toxic Erasure (ToxE), and demonstrate how recent unlearning algorithms, including those explicitly designed for robustness, can be circumvented through targeted backdoor attacks. The threat is realized by establishing a link between a trigger and the undesired content. Subsequent unlearning attempts fail to erase this link, allowing adversaries to produce harmful content. We instantiate ToxE via two established backdoor attacks: one targeting the text encoder and another manipulating the cross-attention layers. Further, we introduce Deep Intervention Score-based Attack (DISA), a novel, deeper backdoor attack that optimizes the entire U-Net using a score-based objective, improving the attack's persistence across different erasure methods. We evaluate five recent concept erasure methods against our threat model. For celebrity identity erasure, our deep attack circumvents erasure with up to 82% success, averaging 57% across all erasure methods. For explicit content erasure, ToxE attacks can elicit up to 9 times more exposed body parts, with DISA yielding an average increase by a factor of 2.9. These results highlight a critical security gap in current unlearning strategies.

Problem

Research questions and friction points this paper is trying to address.

Backdoor attacks bypass concept erasure in diffusion models

Toxic Erasure (ToxE) threat undermines unlearning robustness

Deep Intervention Score-based Attack (DISA) persists across erasure methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Toxic Erasure (ToxE) threat model

Develops Deep Intervention Score-based Attack (DISA)

Targets text encoder and cross-attention layers

🔎 Similar Papers

Is The Watermarking Of LLM-Generated Code Robust?

2024-03-24Citations: 1

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

2024-06-03arXiv.orgCitations: 0

Authors to Follow