🤖 AI Summary
Current multimodal jailbreaking attacks suffer from poor scalability and high optimization overhead, primarily due to the strong instance-specific coupling between textual prompts and image perturbations. To address this, we propose U3-Attack—the first input-agnostic, universal multimodal jailbreaking method. It jointly optimizes adversarial background patches applied to images and a safety-aware synonym substitution set for sensitive words, thereby simultaneously evading both text prompt filters and image safety classifiers. Its core innovation lies in achieving cross-prompt, cross-image, and cross-model universality without requiring per-instance re-optimization. By integrating adversarial patch generation with controllable synonym modeling, U3-Attack enables end-to-end, multimodal co-perturbation. Extensive evaluations on open-source and commercial text-to-image models—including Runway Gen-3 Inpainting—demonstrate its effectiveness: it achieves approximately 4× higher attack success rate than the state-of-the-art MMA-Diffusion, while significantly improving efficiency and generalization.
📝 Abstract
Various (text) prompt filters and (image) safety checkers have been implemented to mitigate the misuse of Text-to-Image (T2I) models in creating Not-Safe-For-Work (NSFW) content.In order to expose potential security vulnerabilities of such safeguards, multimodal jailbreaks have been studied.However, existing jailbreaks are limited to prompt-specific and image-specific perturbations, which suffer from poor scalability and time-consuming optimization.To address these limitations, we propose Universally Unfiltered and Unseen (U3)-Attack, a multimodal jailbreak attack method against T2I safeguards.Specifically, U3-Attack optimizes an adversarial patch on the image background to universally bypass safety checkers and optimizes a safe paraphrase set from a sensitive word to universally bypass prompt filters while eliminating redundant computations.Extensive experimental results demonstrate the superiority of our U3-Attack on both open-source and commercial T2I models.For example, on the commercial Runway-inpainting model with both prompt filter and safety checker, our U3-Attack achieves $~4 imes$ higher success rates than the state-of-the-art multimodal jailbreak attack, MMA-Diffusion.Content Warning: This paper includes examples of NSFW content.