Side Effects of Erasing Concepts from Diffusion Models

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing Concept Erasure Techniques (CETs) for diffusion models suffer from critical robustness deficiencies—including vulnerability to semantically similar or hierarchical prompts, interference from neighboring concepts, target-concept leakage, and attribute leakage—thereby compromising privacy, copyright protection, and safety. To address this, we propose the Side Effect Evaluation (SEE) benchmark, which constructs hierarchical, compositional prompts via superclass-subclass taxonomies and semantic variants to enable automated, systematic assessment of unintended side effects. Our experiments are the first to reveal pervasive attention anomalies and cross-attribute leakage across CETs, empirically demonstrating their fragility against fine-grained semantic attacks. We publicly release the SEE dataset and evaluation toolkit, establishing the first standardized, side-effect–oriented evaluation framework for trustworthy content generation in diffusion models.

Technology Category

Application Category

📝 Abstract

Concerns about text-to-image (T2I) generative models infringing on privacy, copyright, and safety have led to the development of Concept Erasure Techniques (CETs). The goal of an effective CET is to prohibit the generation of undesired ``target'' concepts specified by the user, while preserving the ability to synthesize high-quality images of the remaining concepts. In this work, we demonstrate that CETs can be easily circumvented and present several side effects of concept erasure. For a comprehensive measurement of the robustness of CETs, we present Side Effect Evaluation (see), an evaluation benchmark that consists of hierarchical and compositional prompts that describe objects and their attributes. This dataset and our automated evaluation pipeline quantify side effects of CETs across three aspects: impact on neighboring concepts, evasion of targets, and attribute leakage. Our experiments reveal that CETs can be circumvented by using superclass-subclass hierarchy and semantically similar prompts, such as compositional variants of the target. We show that CETs suffer from attribute leakage and counterintuitive phenomena of attention concentration or dispersal. We release our dataset, code, and evaluation tools to aid future work on robust concept erasure.

Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of concept erasure techniques in diffusion models

Measuring side effects including attribute leakage and evasion

Assessing impact on neighboring concepts and attention patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical and compositional prompt evaluation

Automated pipeline quantifying concept leakage

Robustness testing via semantic circumvention techniques

🔎 Similar Papers

Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts