JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing vision-language model (VLM) jailbreaking attacks suffer from poor diversity and limited scalability due to reliance on white-box access or manually crafted prompts. To address this, we propose JPRO—the first black-box jailbreaking framework leveraging multi-agent collaboration. JPRO automates cross-modal, diverse adversarial prompt generation via a tactic-driven seed initialization and adaptive optimization loop. Its key contributions are: (1) a novel multi-agent coordination mechanism that decouples prompt engineering, image perturbation, and strategy evolution; (2) gradient- and parameter-free optimization relying solely on API-level interactions; and (3) achieving >60% average attack success rate on state-of-the-art VLMs (e.g., GPT-4o), outperforming prior methods and systematically exposing security vulnerabilities in multimodal alignment.

Technology Category

Application Category

📝 Abstract

The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.

Problem

Research questions and friction points this paper is trying to address.

Automated jailbreaking of large vision-language models

Overcoming limitations in attack diversity and scalability

Black-box vulnerability assessment without manual intervention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent collaboration for automated jailbreaking

Tactic-driven seed generation module for diverse attacks

Adaptive optimization loop enhancing attack scalability

🔎 Similar Papers

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs