Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image generation models struggle to detect harmful intent concealed within seemingly benign semantic content, rendering their safety filters vulnerable to bypass via simple natural language prompts. This work proposes a low-effort jailbreaking attack method that requires no model access, optimization, or adversarial training, leveraging a novel semantic masking strategy—encompassing artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action replacement—to achieve effective black-box attacks. Through natural language prompt engineering and semantic context obfuscation, the approach attains up to a 74.47% success rate across multiple state-of-the-art text-to-image models, exposing critical deficiencies in existing safety mechanisms’ capacity for deep semantic understanding. Furthermore, this study establishes the first systematic taxonomy of visual jailbreaking attacks.
📝 Abstract
Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.
Problem

Research questions and friction points this paper is trying to address.

jailbreak attacks
text-to-image models
safety filters
prompt moderation
adversarial intent
Innovation

Methods, ideas, or system contributions that make the work stand out.

jailbreak attacks
text-to-image generation
safety filters
prompt-based evasion
adversarial intent
🔎 Similar Papers
No similar papers found.
A
Ahmed B Mustafa
School of Computer Science, University of Nottingham, Wollaton Road, NG8 1BB
Zihan Ye
Zihan Ye
University of Chinese Academy of Sciences (UCAS)
Deep LearningZero-shot LearningComputer VisionGenerative Model
Y
Yang Lu
School of Informatics, Xiamen University, Xiamen, 361005, China
M
Michael P Pound
School of Computer Science, University of Nottingham, Wollaton Road, NG8 1BB
Shreyank N Gowda
Shreyank N Gowda
Assistant Professor at the University of Nottingham
Computer VisionZero-shot LearningGreen AI