Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge of precisely removing specific concepts from text-to-image diffusion models without degrading the generation of unrelated content. The authors propose SAEParate, a novel method that explicitly disentangles concepts in the latent representations of sparse autoencoders. By introducing a concept-aware contrastive learning objective, the approach organizes latent variables into concept-specific clusters, while leveraging GeLU nonlinearities to enhance the encoder’s representational capacity and minimize interference between concepts. Notably, SAEParate achieves effective concept erasure without modifying the parameters of the underlying diffusion model. Evaluated on the UnlearnCanvas benchmark, the method demonstrates state-of-the-art performance, particularly in joint forgetting tasks involving both style and object concepts.

📝 Abstract

Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.

Problem

Research questions and friction points this paper is trying to address.

concept unlearning

diffusion models

disentangled representations

sparse autoencoders

concept interference

Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representation

sparse autoencoder

concept-aware contrastive learning