Universally Unfiltered and Unseen:Input-Agnostic Multimodal Jailbreaks against Text-to-Image Model Safeguards

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Current multimodal jailbreaking attacks suffer from poor scalability and high optimization overhead, primarily due to the strong instance-specific coupling between textual prompts and image perturbations. To address this, we propose U3-Attack—the first input-agnostic, universal multimodal jailbreaking method. It jointly optimizes adversarial background patches applied to images and a safety-aware synonym substitution set for sensitive words, thereby simultaneously evading both text prompt filters and image safety classifiers. Its core innovation lies in achieving cross-prompt, cross-image, and cross-model universality without requiring per-instance re-optimization. By integrating adversarial patch generation with controllable synonym modeling, U3-Attack enables end-to-end, multimodal co-perturbation. Extensive evaluations on open-source and commercial text-to-image models—including Runway Gen-3 Inpainting—demonstrate its effectiveness: it achieves approximately 4× higher attack success rate than the state-of-the-art MMA-Diffusion, while significantly improving efficiency and generalization.

Technology Category

Application Category

📝 Abstract

Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) content.In order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been studied.However, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming optimization.To address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I safeguards.Specifically, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant computations.Extensive experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I models.For example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves $~4 imes$ higher success rates than the state-of-the-art multimodal jailbreak attack, MMA-Diffusion.Content Warning: This paper includes examples of NSFW content.

Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in T2I model safeguards

Developing universal multimodal jailbreak attacks

Improving scalability and efficiency of bypass methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial patch bypasses image safety checkers

Safe paraphrase set evades text prompt filters

Eliminates redundant computations for efficiency

🔎 Similar Papers

Multimodal Pragmatic Jailbreak on Text-to-image Models