Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the challenge of real-time NSFW content detection during text-to-image generation in diffusion models, this paper proposes the first in-generation NSFW detection method based on predictive noise signals. Leveraging the rich semantic information embedded in noise predictions, we design a dynamic recognition mechanism targeting seven NSFW categories, enabling highly sensitive and low-latency detection for both benign and adversarial prompts. Crucially, our approach requires no architectural modifications to the diffusion model nor reliance on external moderation modules—risk intervention is achieved solely through internal signal monitoring. Evaluated on a benchmark dataset covering seven NSFW categories, our method achieves a mean accuracy of 91.32%, substantially outperforming seven state-of-the-art baseline methods. This work establishes a novel paradigm for safe, controllable image generation in diffusion models.

Technology Category

Application Category

📝 Abstract

Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Detect NSFW content during diffusion-based image generation

Leverage predicted noise as internal signal for detection

Improve accuracy over pre and post-generation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages predicted noise for NSFW detection

Detects NSFW content during generation phase

Achieves high accuracy over adversarial prompts

🔎 Similar Papers

Espresso: Robust Concept Filtering in Text-to-Image Models