🤖 AI Summary
This work addresses the challenge of controllable governance over sensitive, harmful, or copyright-protected concepts in text-to-image diffusion models. Methodologically, it introduces a “hide–key-triggered recovery” dual-mode mechanism: learnable prompt embeddings are injected into the cross-attention modules of Stable Diffusion and modulated conditionally via cryptographic-style keys to dynamically suppress unwanted concepts while enabling precise, on-demand recovery. This is the first approach to achieve coexistence of concept-level suppression and controllable restoration within generative models—avoiding permanent removal, which degrades model performance and causes irreversible information loss. Experiments demonstrate that the method significantly enhances content safety and governance controllability without compromising generation quality or diversity. It thus provides a novel pathway for compliant deployment of AI-generated content.
📝 Abstract
Diffusion models have demonstrated remarkable capability in generating high-quality visual content from textual descriptions. However, since these models are trained on large-scale internet data, they inevitably learn undesirable concepts, such as sensitive content, copyrighted material, and harmful or unethical elements. While previous works focus on permanently removing such concepts, this approach is often impractical, as it can degrade model performance and lead to irreversible loss of information. In this work, we introduce a novel concept-hiding approach that makes unwanted concepts inaccessible to public users while allowing controlled recovery when needed. Instead of erasing knowledge from the model entirely, we incorporate a learnable prompt into the cross-attention module, acting as a secure memory that suppresses the generation of hidden concepts unless a secret key is provided. This enables flexible access control -- ensuring that undesirable content cannot be easily generated while preserving the option to reinstate it under restricted conditions. Our method introduces a new paradigm where concept suppression and controlled recovery coexist, which was not feasible in prior works. We validate its effectiveness on the Stable Diffusion model, demonstrating that hiding concepts mitigate the risks of permanent removal while maintaining the model's overall capability.