PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models are vulnerable to adversarial prompts that elicit NSFW content, necessitating efficient, lossless safety mechanisms. This paper proposes a soft-prompt-guided content moderation method that requires no model architecture modification and preserves inference efficiency. We pioneer the adaptation of large language model (LLM) system prompting to the T2I domain by end-to-end optimizing a universal safety soft prompt $P^*$ in the text embedding space—serving as an implicit system instruction for zero-shot, high-fidelity suppression of NSFW inputs. Our approach integrates gradient-driven safety objective alignment with cross-dataset robust fine-tuning. Experiments demonstrate that our method reduces unsafe generation rates to 5.84% across three major benchmarks, achieves 7.8× higher inference speed than state-of-the-art defenses, significantly outperforms eight leading mitigation approaches, and fully preserves the quality of benign image generations.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) models have been shown to be vulnerable to misuse, particularly in generating not-safe-for-work (NSFW) content, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. Extensive experiments across three datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 7.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.
Problem

Research questions and friction points this paper is trying to address.

T2I Models
Content Filtering
Ethical Issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

PromptGuard
T2I Optimization
Safe Content Generation
🔎 Similar Papers
No similar papers found.
Lingzhi Yuan
Lingzhi Yuan
PhD at University of Maryland, College Park & BEng at Zhejiang University
Trustworsy MLAI SafetyAdversarial Robustness
X
Xinfeng Li
Nanyang Technological University
Chejian Xu
Chejian Xu
University of Illinois at Urbana-Champaign
Large Language ModelTrustworthy AI
Guanhong Tao
Guanhong Tao
Assistant Professor, University of Utah
Machine LearningComputer Security
Xiaojun Jia
Xiaojun Jia
Nanyang Technological University
Explainable AIRobust AIEfficient AI
Y
Yihao Huang
Nanyang Technological University
W
Wei Dong
Nanyang Technological University
Y
Yang Liu
Nanyang Technological University
X
Xiaofeng Wang
Indiana University Bloomington
B
Bo Li
University of Chicago, University of Illinois at Urbana–Champaign