🤖 AI Summary
Existing MLLM jailbreaking attack research overemphasizes Attack Success Rate (ASR) while neglecting whether outputs genuinely fulfill malicious intent—leading to high ASR but low harmfulness. This work proposes JPS, the first framework jointly optimizing visual perturbations and textual prompts: it enhances visual jailbreaking capability via target-directed adversarial image generation and improves malicious intent fulfillment through multi-agent collaborative prompt optimization. To rigorously evaluate intent realization, we introduce Malicious Intent Fulfillment Rate (MIFR) as a novel metric. JPS establishes a vision–language joint iterative optimization mechanism, significantly outperforming state-of-the-art methods across multiple MLLMs (e.g., LLaVA, Qwen-VL) and benchmarks. It simultaneously boosts both ASR and MIFR, demonstrating effectiveness, robustness, and strong cross-model generalizability.
📝 Abstract
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, underline{J}ailbreak MLLMs with collaborative visual underline{P}erturbation and textual underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. color{warningcolor}{Warning: This paper contains potentially sensitive contents.}