🤖 AI Summary
This work addresses the vulnerability of current text-to-image (T2I) models to jailbreak attacks that circumvent safety filters and generate harmful content. The authors formulate jailbreaking as an optimization problem over structured prompt distributions under black-box, end-to-end reward feedback. They propose a lightweight, low-dimensional hybrid policy framework that enables efficient exploration while preserving semantic fidelity. By integrating distributional optimization, semantically anchored prompts, and feedback from CLIP and NSFW scoring models, the method effectively exposes structural weaknesses in existing T2I safety mechanisms. Experiments demonstrate a significant improvement in attack success rate—raising ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo—and achieve consistent jailbreaking performance across multiple open-source and commercial T2I models.
📝 Abstract
Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.