🤖 AI Summary
This work addresses the challenge of completely erasing specific concepts from pre-trained text-to-image diffusion models without retraining, while simultaneously preventing circumvention via synonymous or adversarial prompts. The authors propose PURE, a method that uniquely shifts the representation of the target concept from the text embedding space to the U-Net’s cross-attention activation space. By applying a single linear projection to the key-value weights through closed-form editing, PURE constructs orthogonal subspaces for forgetting and retention. Evaluated across ten concept categories—including artistic styles, celebrities, intellectual property, and NSFW content—PURE significantly reduces target leakage while preserving non-target generation capabilities, achieving state-of-the-art balance between effective unlearning and model utility.
📝 Abstract
Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.