🤖 AI Summary
This work addresses the challenge of efficiently unlearning unsafe concepts in text-to-image diffusion models. We propose an adversarial anchor-based fine-tuning method, whose core innovation lies in constructing semantically similar yet harm-free text embedding anchors—thereby breaking the inherent trade-off between forgetting efficacy and model fidelity. Our approach integrates adversarial optimization in the text embedding space, distribution-aligned fine-tuning, and concept sensitivity analysis to enable fine-grained, controllable safety-aware forgetting. Evaluated on multiple benchmarks, our method significantly outperforms state-of-the-art approaches: it achieves a 23.6% improvement in harmful concept removal rate, while reducing degradation in text–image alignment and generation diversity by over 40%. Crucially, it preserves both safety guarantees and generative capability, striking a superior balance between robust unlearning and model utility.
📝 Abstract
Security concerns surrounding text-to-image diffusion models have driven researchers to unlearn inappropriate concepts through fine-tuning. Recent fine-tuning methods typically align the prediction distributions of unsafe prompts with those of predefined text anchors. However, these techniques exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts. In this paper, we systematically analyze the impact of diverse text anchors on unlearning performance. Guided by this analysis, we propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue. These adversarial anchors are crafted to closely resemble the embeddings of undesirable concepts to maintain overall model performance, while selectively excluding defining attributes of these concepts for effective erasure. Extensive experiments demonstrate that AdvAnchor outperforms state-of-the-art methods. Our code is publicly available at https://anonymous.4open.science/r/AdvAnchor.