Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Diffusion models excel in image generation but pose critical risks—including privacy leakage (e.g., celebrity faces), content safety violations (e.g., NSFW generation), and stylistic bias—necessitating rigorous, capability-preserving erasure of sensitive concepts. This paper introduces SCORE, the first framework to formulate concept erasure as an adversarial independence problem: it minimizes mutual information between target concepts and latent variables, yielding provable erasure guarantees and quantifiable upper bounds on residual leakage, supported by theoretical convergence analysis. SCORE integrates adversarial optimization, trajectory-consistency constraints, and saliency-guided fine-tuning, enabling precise suppression on Stable Diffusion and FLUX. Evaluated across four benchmark tasks, SCORE consistently outperforms state-of-the-art methods—including EraseAnything and ANT—with up to 12.5% improvement in erasure efficacy while preserving generation quality intact.

Technology Category

Application Category

📝 Abstract

Diffusion models have achieved unprecedented success in image generation but pose increasing risks in terms of privacy, fairness, and security. A growing demand exists to emph{erase} sensitive or harmful concepts (e.g., NSFW content, private individuals, artistic styles) from these models while preserving their overall generative capabilities. We introduce extbf{SCORE} (Secure and Concept-Oriented Robust Erasure), a novel framework for robust concept removal in diffusion models. SCORE formulates concept erasure as an emph{adversarial independence} problem, theoretically guaranteeing that the model's outputs become statistically independent of the erased concept. Unlike prior heuristic methods, SCORE minimizes the mutual information between a target concept and generated outputs, yielding provable erasure guarantees. We provide formal proofs establishing convergence properties and derive upper bounds on residual concept leakage. Empirically, we evaluate SCORE on Stable Diffusion and FLUX across four challenging benchmarks: object erasure, NSFW removal, celebrity face suppression, and artistic style unlearning. SCORE consistently outperforms state-of-the-art methods including EraseAnything, ANT, MACE, ESD, and UCE, achieving up to extbf{12.5%} higher erasure efficacy while maintaining comparable or superior image quality. By integrating adversarial optimization, trajectory consistency, and saliency-driven fine-tuning, SCORE sets a new standard for secure and robust concept erasure in diffusion models.

Problem

Research questions and friction points this paper is trying to address.

Robustly erase sensitive concepts from diffusion models

Ensure model outputs are statistically independent of erased concepts

Maintain generative capabilities while removing harmful content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial independence for concept removal

Minimizes mutual information for guarantees

Integrates adversarial optimization and fine-tuning

🔎 Similar Papers

Erasing Conceptual Knowledge from Language Models