Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts

📅 2024-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of controllable governance over sensitive, harmful, or copyright-protected concepts in text-to-image diffusion models. Methodologically, it introduces a “hide–key-triggered recovery” dual-mode mechanism: learnable prompt embeddings are injected into the cross-attention modules of Stable Diffusion and modulated conditionally via cryptographic-style keys to dynamically suppress unwanted concepts while enabling precise, on-demand recovery. This is the first approach to achieve coexistence of concept-level suppression and controllable restoration within generative models—avoiding permanent removal, which degrades model performance and causes irreversible information loss. Experiments demonstrate that the method significantly enhances content safety and governance controllability without compromising generation quality or diversity. It thus provides a novel pathway for compliant deployment of AI-generated content.

Technology Category

Application Category

📝 Abstract
Diffusion models have demonstrated remarkable capability in generating high-quality visual content from textual descriptions. However, since these models are trained on large-scale internet data, they inevitably learn undesirable concepts, such as sensitive content, copyrighted material, and harmful or unethical elements. While previous works focus on permanently removing such concepts, this approach is often impractical, as it can degrade model performance and lead to irreversible loss of information. In this work, we introduce a novel concept-hiding approach that makes unwanted concepts inaccessible to public users while allowing controlled recovery when needed. Instead of erasing knowledge from the model entirely, we incorporate a learnable prompt into the cross-attention module, acting as a secure memory that suppresses the generation of hidden concepts unless a secret key is provided. This enables flexible access control -- ensuring that undesirable content cannot be easily generated while preserving the option to reinstate it under restricted conditions. Our method introduces a new paradigm where concept suppression and controlled recovery coexist, which was not feasible in prior works. We validate its effectiveness on the Stable Diffusion model, demonstrating that hiding concepts mitigate the risks of permanent removal while maintaining the model's overall capability.
Problem

Research questions and friction points this paper is trying to address.

Control access to sensitive content
Preserve model performance
Enable controlled recovery of hidden concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable prompts for concept hiding
Cross-attention module integration
Controlled recovery with secret key
A
Anh-Vu Bui
Monash University
Khanh Doan
Khanh Doan
VinAI Research
Generative Models
Trung Le
Trung Le
Faculty of Information Technology, Monash University, Australia
Adversarial Machine LearningGenerative ModelsModel UnlearningModel EditingOptimal Transport
P
Paul Montague
Defence Science and Technology Group, Australia
T
Tamas Abraham
Defence Science and Technology Group, Australia
D
Dinh Q. Phung
Monash University