GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing jailbreaking methods fail against robust multimodal safety filters (text + image) deployed on text-to-image (T2I) models. Method: We propose the first automated jailbreaking framework that jointly optimizes textual prompts and injects benign visual cues, enabling semantic- and visual-level bypass through dynamic prompt refinement and multimodal feedback integration. Our two-stage architecture combines dynamic optimization with adaptive safety-indicator injection, uniquely integrating CLIP-guided similarity optimization and reinforcement learning–driven visual cue injection. The framework unifies LLM-based iterative prompt refinement, adversarial training against text filters, CLIP-based feedback, and multimodal red-teaming evaluation. Contribution/Results: On ShieldLM-7B, jailbreak success increases from 12.5% to 99.0%, CLIP similarity reaches 0.2762, and runtime decreases by 4.2×. The method successfully jailbreaks DALL·E 3 and generalizes to unseen filters—including GPT-4.1—demonstrating cross-model and cross-filter robustness.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) generation models can inadvertently produce not-safe-for-work (NSFW) content, prompting the integration of text and image safety filters. Recent advances employ large language models (LLMs) for semantic-level detection, rendering traditional token-level perturbation attacks largely ineffective. However, our evaluation shows that existing jailbreak methods are ineffective against these modern filters. We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback. It consists of two key components: (i) Dynamic Optimization, an iterative process that guides a large language model (LLM) using feedback from text safety filters and CLIP similarity scores to generate semantically aligned adversarial prompts; and (ii) Adaptive Safety Indicator Injection, which formulates the injection of benign visual cues as a reinforcement learning problem to bypass image-level filters. GhostPrompt achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5% (Sneakyprompt) to 99.0%, improving CLIP score from 0.2637 to 0.2762, and reducing the time cost by $4.2 imes$. Moreover, it generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images in our evaluation, revealing systemic vulnerabilities in current multimodal defenses. To support further research on AI safety and red-teaming, we will release code and adversarial prompts under a controlled-access protocol.
Problem

Research questions and friction points this paper is trying to address.

Bypassing modern text and image safety filters in T2I models
Automating adversarial prompt generation using dynamic optimization
Exposing vulnerabilities in multimodal defenses like GPT-4.1 and DALLE 3
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic prompt optimization with multimodal feedback
Iterative LLM guidance using safety filter feedback
Reinforcement learning for benign visual cue injection
🔎 Similar Papers
No similar papers found.
Z
Zixuan Chen
Shanghai Jiao Tong University, Shanghai, China
H
Hao Lin
Sun Yat-sen University, Guangzhou, China
K
Ke Xu
Shanghai Jiao Tong University, Shanghai, China
Xinghao Jiang
Xinghao Jiang
Professor, Shanghai Jiao Tong University
T
Tanfeng Sun
Shanghai Jiao Tong University, Shanghai, China