ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

📅 2026-01-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses the safety, copyright, and ethical risks posed by sensitive concepts—such as nudity, specific artistic styles, or objects—in text-to-image diffusion models. To mitigate these concerns, the authors propose a plug-and-play, fine-tuning-free concept erasure method that dynamically identifies and replaces critical activation regions during forward propagation. By analyzing the activation differences induced by paired prompts, the approach precisely suppresses target concepts without requiring additional training data or model retraining. Evaluated across three types of sensitive content removal tasks, the method achieves state-of-the-art performance while maintaining strong adversarial robustness and preserving the model’s original generative capabilities to a significant extent.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

Problem

Research questions and friction points this paper is trying to address.

concept erasure

diffusion models

training-free

text-to-image generation

safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

concept erasure

activation patching