Universally Unfiltered and Unseen:Input-Agnostic Multimodal Jailbreaks against Text-to-Image Model Safeguards

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal jailbreaking attacks suffer from poor scalability and high optimization overhead, primarily due to the strong instance-specific coupling between textual prompts and image perturbations. To address this, we propose U3-Attack—the first input-agnostic, universal multimodal jailbreaking method. It jointly optimizes adversarial background patches applied to images and a safety-aware synonym substitution set for sensitive words, thereby simultaneously evading both text prompt filters and image safety classifiers. Its core innovation lies in achieving cross-prompt, cross-image, and cross-model universality without requiring per-instance re-optimization. By integrating adversarial patch generation with controllable synonym modeling, U3-Attack enables end-to-end, multimodal co-perturbation. Extensive evaluations on open-source and commercial text-to-image models—including Runway Gen-3 Inpainting—demonstrate its effectiveness: it achieves approximately 4× higher attack success rate than the state-of-the-art MMA-Diffusion, while significantly improving efficiency and generalization.

Technology Category

Application Category

📝 Abstract
Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) content.In order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been studied.However, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming optimization.To address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I safeguards.Specifically, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant computations.Extensive experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I models.For example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves $~4 imes$ higher success rates than the state-of-the-art multimodal jailbreak attack, MMA-Diffusion.Content Warning: This paper includes examples of NSFW content.
Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in T2I model safeguards
Developing universal multimodal jailbreak attacks
Improving scalability and efficiency of bypass methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial patch bypasses image safety checkers
Safe paraphrase set evades text prompt filters
Eliminates redundant computations for efficiency
🔎 Similar Papers
No similar papers found.
Song Yan
Song Yan
Senior Engineer at Honor Device Co., Ltd
Computer VisionObject Tracking & Detection & Segmentation
H
Hui Wei
School of Computer Science, Wuhan University, Wuhan, China
J
Jinlong Fei
Information Engineering University, Zhengzhou, China
G
Guoliang Yang
Information Engineering University, Zhengzhou, China
Zhengyu Zhao
Zhengyu Zhao
Xi'an Jiaotong University, China
Adversarial Machine LearningComputer Vision
Z
Zheng Wang
Wuhan University, Wuhan, China