Jailbreaking Text-to-Image Models with LLM-Based Agents

📅 2024-08-01

🏛️ arXiv.org

📈 Citations: 17

✨ Influential: 1

career value

214K/year

🤖 AI Summary

Safety filters in text-to-image (T2I) models are vulnerable to jailbreaking attacks, yet existing approaches suffer from unrealistic access assumptions, unnatural prompts, limited search spaces, and high query overhead. Method: We propose Atlas—a novel LLM-VLM dual-brain multi-agent framework for generative AI safety evaluation—enabling efficient black-box jailbreaking via synergistic collaboration: an LLM performs semantic-driven iterative prompt mutation and strategy evolution (integrating in-context learning and chain-of-thought reasoning), while a VLM provides visual feedback to rank and select high-success-rate prompts; memory-augmented planning and tool-calling mechanisms further enhance robustness. Contribution/Results: Evaluated on multiple state-of-the-art multimodal safety-filtered T2I models, Atlas achieves significantly lower query costs without compromising image quality, outperforming prior jailbreaking methods in both efficiency and effectiveness.

Technology Category

Application Category

📝 Abstract

Recent advancements have significantly improved automated task-solving capabilities using autonomous agents powered by large language models (LLMs). However, most LLM-based agents focus on dialogue, programming, or specialized domains, leaving their potential for addressing generative AI safety tasks largely unexplored. In this paper, we propose Atlas, an advanced LLM-based multi-agent framework targeting generative AI models, specifically focusing on jailbreak attacks against text-to-image (T2I) models with built-in safety filters. Atlas consists of two agents, namely the mutation agent and the selection agent, each comprising four key modules: a vision-language model (VLM) or LLM brain, planning, memory, and tool usage. The mutation agent uses its VLM brain to determine whether a prompt triggers the T2I model's safety filter. It then collaborates iteratively with the LLM brain of the selection agent to generate new candidate jailbreak prompts with the highest potential to bypass the filter. In addition to multi-agent communication, we leverage in-context learning (ICL) memory mechanisms and the chain-of-thought (COT) approach to learn from past successes and failures, thereby enhancing Atlas's performance. Our evaluation demonstrates that Atlas successfully jailbreaks several state-of-the-art T2I models equipped with multi-modal safety filters in a black-box setting. Additionally, Atlas outperforms existing methods in both query efficiency and the quality of generated images. This work convincingly demonstrates the successful application of LLM-based agents in studying the safety vulnerabilities of popular text-to-image generation models. We urge the community to consider advanced techniques like ours in response to the rapidly evolving text-to-image generation field.

Problem

Research questions and friction points this paper is trying to address.

Automated generation of natural jailbreak prompts for T2I models

Overcoming limitations of existing jailbreak attack methods

Enhancing efficiency in black-box testing of model vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven fuzzing framework for jailbreak prompts

Guided mutation engine ensures natural variations

Black-box efficient jailbreak with minimal queries

🔎 Similar Papers

Perception-guided Jailbreak against Text-to-Image Models