🤖 AI Summary
This work investigates the jailbreak vulnerability of safety-aligned text-to-image models under adversarial prompting. We propose PromptTune, a query-free attack method that fine-tunes a large language model (AttackLLM) to generate transferable and generalizable adversarial prompts—enabling the first known *query-free jailbreak* of multi-layer safety filters without iterative queries to the target model. Our approach integrates adversarial prompt engineering with a unified multi-safety-barrier evaluation framework, incorporating NSFW detection, content-alignment models, and other mainstream safety mechanisms. Evaluated on three unsafe prompt benchmarks and five state-of-the-art safety systems, PromptTune achieves up to a 42% higher success rate than existing query-free baselines. Moreover, it significantly boosts the efficiency of subsequent query-based attacks. This work establishes a critical advance in query-free adversarial prompting, markedly improving generalizability, cross-model transferability, and practical applicability.
📝 Abstract
Text-to-Image models may generate harmful content, such as pornographic images, particularly when unsafe prompts are submitted. To address this issue, safety filters are often added on top of text-to-image models, or the models themselves are aligned to reduce harmful outputs. However, these defenses remain vulnerable when an attacker strategically designs adversarial prompts to bypass these safety guardrails. In this work, we propose PromptTune, a method to jailbreak text-to-image models with safety guardrails using a fine-tuned large language model. Unlike other query-based jailbreak attacks that require repeated queries to the target model, our attack generates adversarial prompts efficiently after fine-tuning our AttackLLM. We evaluate our method on three datasets of unsafe prompts and against five safety guardrails. Our results demonstrate that our approach effectively bypasses safety guardrails, outperforms existing no-box attacks, and also facilitates other query-based attacks.