🤖 AI Summary
To address the security vulnerability of latent diffusion models (LDMs) to adversarial attacks—where maliciously crafted inputs trigger generation of NSFW content (e.g., pornographic or violent images)—this paper proposes CROPS, a training-free, model-agnostic safe synthesis framework. CROPS enhances robustness of safety verification via iterative or stochastic perturbations applied jointly to text prompts and latent-space inputs. It further introduces CROPS-1, a lightweight, single-step diffusion-based NSFW detector enabling efficient approximate inference. Key contributions include: (1) the first training-free and model-agnostic defense paradigm against adversarial attacks on LDMs; (2) defense success rates exceeding 98% across diverse attack types; and (3) seamless compatibility with mainstream LDMs such as Stable Diffusion, achieving a 76% reduction in detection latency without fine-tuning or additional training data.
📝 Abstract
With advances in diffusion models, image generation has shown significant performance improvements. This raises concerns about the potential abuse of image generation, such as the creation of explicit or violent images, commonly referred to as Not Safe For Work (NSFW) content. To address this, the Stable Diffusion model includes several safety checkers to censor initial text prompts and final output images generated from the model. However, recent research has shown that these safety checkers have vulnerabilities against adversarial attacks, allowing them to generate NSFW images. In this paper, we find that these adversarial attacks are not robust to small changes in text prompts or input latents. Based on this, we propose CROPS (Circular or RandOm Prompts for Safety), a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training. Moreover, we develop an approach that utilizes one-step diffusion models for efficient NSFW detection (CROPS-1), further reducing computational resources. We demonstrate the superiority of our method in terms of performance and applicability.