ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the safety, copyright, and ethical risks posed by sensitive concepts—such as nudity, specific artistic styles, or objects—in text-to-image diffusion models. To mitigate these concerns, the authors propose a plug-and-play, fine-tuning-free concept erasure method that dynamically identifies and replaces critical activation regions during forward propagation. By analyzing the activation differences induced by paired prompts, the approach precisely suppresses target concepts without requiring additional training data or model retraining. Evaluated across three types of sensitive content removal tasks, the method achieves state-of-the-art performance while maintaining strong adversarial robustness and preserving the model’s original generative capabilities to a significant extent.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.
Problem

Research questions and friction points this paper is trying to address.

concept erasure
diffusion models
training-free
text-to-image generation
safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
concept erasure
activation patching
diffusion models
prompt-pair analysis
🔎 Similar Papers
No similar papers found.
Yi Sun
Yi Sun
Lecturer,Graduate School of Information Technology,Kobe Institute of Computing
Educational engineeringComputer EducationText mining
Xinhao Zhong
Xinhao Zhong
Harbin Institute of Technology, Shenzhen
Data-centric AIEffiecient AI
H
Hongyan Li
Harbin Institute of Technology, Shenzhen
Y
Yimin Zhou
Tsinghua Shenzhen International Graduate School, Tsinghua University
Junhao Li
Junhao Li
Assistant Project Scientist, Cognitive Science, University of California, San Diego
Non-coding RNAsDNA methylationEpigeneticsBioinformatics
B
Bin Chen
Harbin Institute of Technology, Shenzhen, Peng Cheng Laboratory
X
Xuan Wang
Harbin Institute of Technology, Shenzhen, Peng Cheng Laboratory