Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address the challenge that concept erasure in text-to-image diffusion models often degrades generation quality, this paper proposes the “Interpret then Deactivate” (ItD) framework. ItD employs sparse autoencoders (SAEs) to disentangle textual concepts into interpretable intermediate features and innovatively repurposes these SAEs as zero-shot classifiers to precisely identify and permanently deactivate feature dimensions associated with target concepts—enabling fine-tuning-free, retraining-free concept removal. The method supports multi-concept joint filtering and exhibits robustness against adversarial prompts. Evaluated on tasks including celebrity identity removal, artistic style suppression, and sensitive content filtering, ItD achieves high-precision erasure while fully preserving original image fidelity and generation quality. Its modular design ensures plug-and-play deployment across diverse diffusion models without architectural modification or additional training overhead.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images but also raise people's concerns about generating harmful or misleading content. While extensive approaches have been proposed to erase unwanted concepts without requiring retraining from scratch, they inadvertently degrade performance on normal generation tasks. In this work, we propose Interpret then Deactivate (ItD), a novel framework to enable precise concept removal in T2I diffusion models while preserving overall performance. ItD first employs a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating the specific features associated with target concepts, we repurpose SAE as a zero-shot classifier that identifies whether the input prompt includes target concepts, allowing selective concept erasure in diffusion models. Moreover, we demonstrate that ItD can be easily extended to erase multiple concepts without requiring further training. Comprehensive experiments across celebrity identities, artistic styles, and explicit content demonstrate ItD's effectiveness in eliminating targeted concepts without interfering with normal concept generation. Additionally, ItD is also robust against adversarial prompts designed to circumvent content filters. Code is available at: https://github.com/NANSirun/Interpret-then-deactivate.

Problem

Research questions and friction points this paper is trying to address.

Eliminate harmful concepts in text-to-image models

Preserve normal generation performance during concept removal

Enable zero-shot classification for selective concept erasure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoder interprets concepts as features

Deactivates specific features for concept removal

Zero-shot classifier identifies target concepts in prompts

🔎 Similar Papers

RevCD - Reversed Conditional Diffusion for Generalized Zero-Shot Learning