🤖 AI Summary
Current safety mechanisms for text-to-image diffusion models predominantly rely on binary interception, which is vulnerable to adversarial prompts and suffers from high false-positive rates. This work proposes Disciplined Diffusion, a novel approach that abandons coarse-grained blocking in favor of semantic retrieval to detect implicit malicious concepts within input prompts. During the diffusion process, the method precisely localizes and selectively edits harmful image regions. By integrating malicious semantic evaluation in the embedding space with fine-grained image sanitization, Disciplined Diffusion effectively suppresses NSFW content while preserving the quality of benign generations. This dual strategy significantly enhances model robustness against adversarial attacks and markedly reduces false positives compared to existing interception-based safeguards.
📝 Abstract
Text-to-image (T2I) diffusion models have the ability to build high-quality pictures from text prompts, but they pose safety concerns because they can generate offensive or disturbing imagery when provided with harmful inputs. Existing safety filters typically rely on text-based classifiers or image-based checkers that completely block the output upon detecting a threat, issuing an explicit allow/block feedback signal to the user. This binary strategy leaves models vulnerable to adversarial attacks that alter keywords to bypass detection, and it causes high false-alarm rates that degrade the experience for benign users. To address such vulnerabilities, we propose Disciplined Diffusion (DDiffusion), a novel robust text-to-image diffusion that counters Not Safe For Work (NSFW) generation by uncovering implicit malicious semantics in prompt embeddings. DDiffusion leverages a semantic retrieval mechanism to evaluate prompts against concept distributions rather than relying on brittle pairwise similarity. Furthermore, it employs a localization method during the diffusion process to selectively edit only the harmful regions of the generated image. By returning locally sanitized images instead of applying uniform blocking, DDiffusion suppresses malicious content while preserving generation fidelity for benign prompts and avoiding the binary allow-deny signal on which existing probing attacks rely.