JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing MLLM jailbreaking attack research overemphasizes Attack Success Rate (ASR) while neglecting whether outputs genuinely fulfill malicious intent—leading to high ASR but low harmfulness. This work proposes JPS, the first framework jointly optimizing visual perturbations and textual prompts: it enhances visual jailbreaking capability via target-directed adversarial image generation and improves malicious intent fulfillment through multi-agent collaborative prompt optimization. To rigorously evaluate intent realization, we introduce Malicious Intent Fulfillment Rate (MIFR) as a novel metric. JPS establishes a vision–language joint iterative optimization mechanism, significantly outperforming state-of-the-art methods across multiple MLLMs (e.g., LLaVA, Qwen-VL) and benchmarks. It simultaneously boosts both ASR and MIFR, demonstrating effectiveness, robustness, and strong cross-model generalizability.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, underline{J}ailbreak MLLMs with collaborative visual underline{P}erturbation and textual underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. color{warningcolor}{Warning: This paper contains potentially sensitive contents.}

Problem

Research questions and friction points this paper is trying to address.

Enhancing jailbreak attacks on MLLMs with visual and textual collaboration

Improving malicious intent fulfillment in generated responses

Introducing MIFR metric for evaluating attack outcome quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative visual and textual jailbreak method

Target-guided adversarial image perturbations

Multi-agent optimized steering prompts

🔎 Similar Papers

Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak