🤖 AI Summary
Existing concept erasure methods for diffusion models struggle to simultaneously achieve robustness—preventing erased concepts from being reactivated via semantically related prompts—and preservation of the model’s ability to generate unrelated concepts. This work proposes AEGIS, a data-free adversarial concept erasure framework that jointly optimizes robustness and preservation in an end-to-end manner through adversarial objective guidance and gradient coordination. Without access to the original training data, AEGIS effectively suppresses reactivation of erased concepts under prompt-based attacks while maintaining high-fidelity generation of unrelated content. Extensive experiments demonstrate that AEGIS significantly outperforms current state-of-the-art methods across multiple benchmarks, establishing a new standard for high-robustness, high-preservation concept erasure without data dependency.
📝 Abstract
Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model's overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both robustness and retention.