Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models pose significant safety risks due to their propensity to generate NSFW (Not Safe For Work) content, yet existing concept erasure methods lack systematic, standardized evaluation. To address this gap, we introduce the first end-to-end NSFW concept erasure evaluation toolkit, establishing a standardized, multi-granularity safety assessment framework that integrates controllable generation benchmarks, distributional shift detection, and a human annotation protocol. We conduct the first comprehensive empirical study across major erasure paradigms—including fine-tuning, gradient masking, concept inversion, and prompt-based adversarial methods—evaluating their performance on cross-category generalization, semantic coherence, and functional robustness. Our bidirectional analysis paradigm links mechanistic insights to practical application outcomes. Experimental results demonstrate that our framework substantially improves the reliability of safety strategy deployment, providing a reproducible benchmark and actionable guidelines for content safety governance.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.
Problem

Research questions and friction points this paper is trying to address.

Evaluating NSFW content erasure in diffusion models
Assessing effectiveness of erasure methods in scenarios
Providing guidance for safe real-world model deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Full-pipeline toolkit for NSFW concept erasure
Systematic study of NSFW erasure methods
Mechanism-empirical interplay analysis for guidance
D
Die Chen
School of Data Science & Engineer, East China Normal University
Zhiwen Li
Zhiwen Li
NIAID
Bioinformatics
C
Cen Chen
School of Data Science & Engineer, East China Normal University
Yuexiang Xie
Yuexiang Xie
Alibaba Group
NLPAutoMLFederated Learning
Xiaodan Li
Xiaodan Li
School of Data Science & Engineer, East China Normal University
J
Jinyan Ye
School of Data Science & Engineer, East China Normal University
Yingda Chen
Yingda Chen
Alibaba Group, Microsoft
Yaliang Li
Yaliang Li
Alibaba Group
Machine Learning