SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This paper addresses the challenge of highly dispersed and imprecise concept representations in text-to-image diffusion models, which impedes accurate concept unlearning. To tackle this, we propose SAEmnesia: a method that employs supervised sparse autoencoders to establish one-to-one mappings between semantic concepts and individual latent-layer neurons, yielding interpretable and minimally split sparse representations. By integrating cross-entropy loss with a systematic concept annotation scheme, SAEmnesia enables targeted activation of specialized neurons during diffusion inference. Experiments demonstrate that SAEmnesia achieves a 9.22% improvement over state-of-the-art methods on the UnlearnCanvas benchmark, boosts accuracy by 28.4% in continuous unlearning across nine object categories, and reduces inference-time hyperparameter search overhead by 96.67%. Overall, SAEmnesia significantly enhances the precision, efficiency, and interpretability of concept unlearning in diffusion models.

Technology Category

Application Category

📝 Abstract

Effective concept unlearning in text-to-image diffusion models requires precise localization of concept representations within the model's latent space. While sparse autoencoders successfully reduce neuron polysemanticity (i.e., multiple concepts per neuron) compared to the original network, individual concept representations can still be distributed across multiple latent features, requiring extensive search procedures for concept unlearning. We introduce SAEmnesia, a supervised sparse autoencoder training method that promotes one-to-one concept-neuron mappings through systematic concept labeling, mitigating feature splitting and promoting feature centralization. Our approach learns specialized neurons with significantly stronger concept associations compared to unsupervised baselines. The only computational overhead introduced by SAEmnesia is limited to cross-entropy computation during training. At inference time, this interpretable representation reduces hyperparameter search by 96.67% with respect to current approaches. On the UnlearnCanvas benchmark, SAEmnesia achieves a 9.22% improvement over the state-of-the-art. In sequential unlearning tasks, we demonstrate superior scalability with a 28.4% improvement in unlearning accuracy for 9-object removal.

Problem

Research questions and friction points this paper is trying to address.

Locating concept representations in diffusion models' latent space

Reducing distributed concept representations across multiple features

Improving computational efficiency for concept unlearning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised sparse autoencoder training for concept unlearning

Promotes one-to-one concept-neuron mappings via labeling

Reduces hyperparameter search by 96.67% at inference

🔎 Similar Papers

Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts