Dark Miner: Defend against undesired generation for text-to-image diffusion models

📅 2024-09-26

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Text-to-image diffusion models often generate undesirable content due to training data contamination (e.g., pornographic or copyrighted material), and existing concept erasure methods—optimized only for known prompts—fail against unseen or adversarial text inputs. To address this, we propose a three-stage adversarial concept erasure framework: *Mine–Verify–Avoid*. For the first time, we formulate global undesirable-generation probability minimization as a greedy search problem in the embedding space. Our method leverages latent-space probability analysis, gradient-driven concept embedding mining, dynamic verification, and conditional avoidance to robustly erase sex-sensitive, copyright-protected, and stylistic undesired concepts. Evaluated across three erasure tasks, our approach outperforms all state-of-the-art methods. It achieves significantly higher success rates under multiple adversarial attacks while preserving image fidelity and diversity.

Technology Category

Application Category

📝 Abstract

Text-to-image diffusion models have been demonstrated with undesired generation due to unfiltered large-scale training data, such as sexual images and copyrights, necessitating the erasure of undesired concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing target concepts. However, they fail to guarantee the desired generation of texts unseen in the training phase, especially for the adversarial texts from malicious attacks. In this paper, we analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of undesired generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. This method greedily mines embeddings with maximum generation probabilities of target concepts and more effectively reduces their generation. In the experiments, we evaluate its performance on the inappropriateness, object, and style concepts. Compared with the previous methods, our method achieves better erasure and defense results, especially under multiple adversarial attacks, while preserving the native generation capability of the models. Our code will be available at https://github.com/RichardSunnyMeng/DarkMiner-offical-codes.

Problem

Research questions and friction points this paper is trying to address.

Defending against undesirable image generation from diffusion models

Eliminating harmful concepts like sexual content and copyright violations

Protecting against adversarial text attacks on text-to-image models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mining embeddings with maximum generation probabilities

Verifying and circumventing undesired concept generation

Defending against adversarial texts and attacks

🔎 Similar Papers

SteerDiff: Steering towards Safe Text-to-Image Diffusion Models