Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models

๐Ÿ“… 2025-02-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Text-to-image diffusion models frequently generate NSFW content, posing significant risks to safe deployment; however, existing concept-erasure methods lack systematic evaluation. This work presents the first comprehensive empirical study of NSFW and its fine-grained subcategoriesโ€™ erasure, evaluating 11 mainstream techniques and 14 variants across six dimensions: erasure rate, image fidelity, semantic alignment, over-erasure, and robustness against both explicit and implicit unsafe prompts. We propose a novel multidimensional evaluation framework incorporating implicit prompt sensitivity analysis, classifier-based toxicity assessment, and a fine-grained NSFW subcategory benchmark; we also release the first open-source, reproducible evaluation toolkit. Experiments reveal that current methods consistently fail under implicit unsafe prompts and frequently induce over-erasure. Our study substantially advances the credibility and practical safety of NSFW erasure techniques for real-world deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-to-image (T2I) diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of these models can inadvertently led they to generate NSFW content even with efforts on filtering NSFW content from the training dataset, posing risks to their safe deployment. While several concept erasure methods have been proposed to mitigate this issue, a comprehensive evaluation of their effectiveness remains absent. To bridge this gap, we present the first systematic investigation of concept erasure methods for NSFW content and its sub-themes in text-to-image diffusion models. At the task level, we provide a holistic evaluation of 11 state-of-the-art baseline methods with 14 variants. Specifically, we analyze these methods from six distinct assessment perspectives, including three conventional perspectives, i.e., erasure proportion, image quality, and semantic alignment, and three new perspectives, i.e., excessive erasure, the impact of explicit and implicit unsafe prompts, and robustness. At the tool level, we perform a detailed toxicity analysis of NSFW datasets and compare the performance of different NSFW classifiers, offering deeper insights into their performance alongside a compilation of comprehensive evaluation metrics. Our benchmark not only systematically evaluates concept erasure methods, but also delves into the underlying factors influencing their performance at the insight level. By synthesizing insights from various evaluation perspectives, we provide a deeper understanding of the challenges and opportunities in the field, offering actionable guidance and inspiration for advancing research and practical applications in concept erasure.
Problem

Research questions and friction points this paper is trying to address.

Evaluates NSFW content erasure in text-to-image models
Assesses 11 methods across six evaluation perspectives
Analyzes toxicity and classifier performance in NSFW datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluates concept erasure methods
Analyzes NSFW datasets for toxicity
Compares NSFW classifiers' performance
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Die Chen
East China Normal University
Zhiwen Li
Zhiwen Li
NIAID
Bioinformatics
C
Cen Chen
East China Normal University
Xiaodan Li
Xiaodan Li
East China Normal University
J
Jinyan Ye
East China Normal University