Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

📅 2024-05-24
🏛️ arXiv.org
📈 Citations: 13
Influential: 1
📄 PDF
🤖 AI Summary
Sensitive concepts inadvertently retained during pretraining of text-to-image diffusion models pose privacy and security risks. Existing machine unlearning (MU) methods suffer from poor generalization—particularly on out-of-distribution (OOD) prompts—and incur substantial utility degradation. This paper proposes DoCo, a Concept Domain Correction framework: (1) It achieves cross-prompt generalizable unlearning by adversarially aligning output domains of sensitive and anchor concepts in the latent space; (2) It introduces concept-preserving gradient surgery—a technique based on gradient decomposition and reweighting—that mitigates gradient conflicts during unlearning to preserve generative functionality. Evaluated across diverse sensitive concepts (including harmful content), artistic styles, and object instances, DoCo improves OOD prompt unlearning success rate by 32.7% while degrading FID by only 1.2, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models have achieved remarkable success in generating photorealistic images. However, the inclusion of sensitive information during pre-training poses significant risks. Machine Unlearning (MU) offers a promising solution to eliminate sensitive concepts from these models. Despite its potential, existing MU methods face two main challenges: 1) limited generalization, where concept erasure is effective only within the unlearned set, failing to prevent sensitive concept generation from out-of-set prompts; and 2) utility degradation, where removing target concepts significantly impacts the model's overall performance. To address these issues, we propose a novel concept domain correction framework named extbf{DoCo} ( extbf{Do}main extbf{Co}rrection). By aligning the output domains of sensitive and anchor concepts through adversarial training, our approach ensures comprehensive unlearning of target concepts. Additionally, we introduce a concept-preserving gradient surgery technique that mitigates conflicting gradient components, thereby preserving the model's utility while unlearning specific concepts. Extensive experiments across various instances, styles, and offensive concepts demonstrate the effectiveness of our method in unlearning targeted concepts with minimal impact on related concepts, outperforming previous approaches even for out-of-distribution prompts.
Problem

Research questions and friction points this paper is trying to address.

Eliminate sensitive concepts from diffusion models
Improve generalization in concept erasure across prompts
Preserve model utility while removing specific concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training aligns sensitive and anchor concepts
Concept-preserving gradient surgery mitigates conflicting gradients
Domain correction ensures comprehensive unlearning of concepts