Robust Concept Erasure Using Task Vectors

📅 2024-04-04
🏛️ arXiv.org
📈 Citations: 9
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the critical safety challenge of **unconditional and robust erasure of harmful concepts** in text-to-image models—without requiring user-provided prompts or interventions. Methodologically, it introduces (1) **Diverse Inversion**, a latent-space inversion strategy that enhances diversity and robustness in concept representation estimation; (2) a **concept-driven embedding set construction** coupled with **sparse weight subset editing**, enabling precise localization and attenuation of parameters associated with the target concept; and (3) **Task Vectors** to dynamically estimate optimal editing strength. Experiments demonstrate that the method significantly improves generalization and robustness of concept erasure under unseen prompts, drastically reduces unintended concept deletion, and preserves over 92% of the original model’s generation quality and diversity. The approach provides a scalable, model-level defense mechanism for safe and controllable deployment of generative AI systems.

Technology Category

Application Category

📝 Abstract
With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.
Problem

Research questions and friction points this paper is trying to address.

Unconditional concept erasure in text-to-image models
Robustness to unexpected user inputs
Maintaining core model performance during erasure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Task Vectors for robust concept erasure
Proposes Diverse Inversion for edit strength estimation
Applies TV edit to subset of model weights
🔎 Similar Papers
No similar papers found.