GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing jailbreaking methods fail against robust multimodal safety filters (text + image) deployed on text-to-image (T2I) models. Method: We propose the first automated jailbreaking framework that jointly optimizes textual prompts and injects benign visual cues, enabling semantic- and visual-level bypass through dynamic prompt refinement and multimodal feedback integration. Our two-stage architecture combines dynamic optimization with adaptive safety-indicator injection, uniquely integrating CLIP-guided similarity optimization and reinforcement learning–driven visual cue injection. The framework unifies LLM-based iterative prompt refinement, adversarial training against text filters, CLIP-based feedback, and multimodal red-teaming evaluation. Contribution/Results: On ShieldLM-7B, jailbreak success increases from 12.5% to 99.0%, CLIP similarity reaches 0.2762, and runtime decreases by 4.2×. The method successfully jailbreaks DALL·E 3 and generalizes to unseen filters—including GPT-4.1—demonstrating cross-model and cross-filter robustness.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) generation models can inadvertently produce not-safe-for-work (NSFW) content, prompting the integration of text and image safety filters. Recent advances employ large language models (LLMs) for semantic-level detection, rendering traditional token-level perturbation attacks largely ineffective. However, our evaluation shows that existing jailbreak methods are ineffective against these modern filters. We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback. It consists of two key components: (i) Dynamic Optimization, an iterative process that guides a large language model (LLM) using feedback from text safety filters and CLIP similarity scores to generate semantically aligned adversarial prompts; and (ii) Adaptive Safety Indicator Injection, which formulates the injection of benign visual cues as a reinforcement learning problem to bypass image-level filters. GhostPrompt achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5% (Sneakyprompt) to 99.0%, improving CLIP score from 0.2637 to 0.2762, and reducing the time cost by $4.2 imes$. Moreover, it generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images in our evaluation, revealing systemic vulnerabilities in current multimodal defenses. To support further research on AI safety and red-teaming, we will release code and adversarial prompts under a controlled-access protocol.

Problem

Research questions and friction points this paper is trying to address.

Bypassing modern text and image safety filters in T2I models

Automating adversarial prompt generation using dynamic optimization

Exposing vulnerabilities in multimodal defenses like GPT-4.1 and DALLE 3

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic prompt optimization with multimodal feedback

Iterative LLM guidance using safety filter feedback

Reinforcement learning for benign visual cue injection

🔎 Similar Papers

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation