JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLLM jailbreaking attack research overemphasizes Attack Success Rate (ASR) while neglecting whether outputs genuinely fulfill malicious intent—leading to high ASR but low harmfulness. This work proposes JPS, the first framework jointly optimizing visual perturbations and textual prompts: it enhances visual jailbreaking capability via target-directed adversarial image generation and improves malicious intent fulfillment through multi-agent collaborative prompt optimization. To rigorously evaluate intent realization, we introduce Malicious Intent Fulfillment Rate (MIFR) as a novel metric. JPS establishes a vision–language joint iterative optimization mechanism, significantly outperforming state-of-the-art methods across multiple MLLMs (e.g., LLaVA, Qwen-VL) and benchmarks. It simultaneously boosts both ASR and MIFR, demonstrating effectiveness, robustness, and strong cross-model generalizability.

Technology Category

Application Category

📝 Abstract
Jailbreak attacks against multimodal large language Models (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, underline{J}ailbreak MLLMs with collaborative visual underline{P}erturbation and textual underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. color{warningcolor}{Warning: This paper contains potentially sensitive contents.}
Problem

Research questions and friction points this paper is trying to address.

Enhancing jailbreak attacks on MLLMs with visual and textual collaboration
Improving malicious intent fulfillment in generated responses
Introducing MIFR metric for evaluating attack outcome quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative visual and textual jailbreak method
Target-guided adversarial image perturbations
Multi-agent optimized steering prompts
🔎 Similar Papers
No similar papers found.
R
Renmiao Chen
CoAI, DCST, Tsinghua Univ., Zhipu AI, Beijing, China
Shiyao Cui
Shiyao Cui
Tsinghua University
X
Xuancheng Huang
Zhipu AI, Beijing, China
Chengwei Pan
Chengwei Pan
Beihang University
Virtual RealityComputer GraphicsComputer VisionMedical Image ProcessingDeep Learning
V
Victor Shea-Jay Huang
Beihang University, Beijing, China
Q
QingLin Zhang
CoAI group, DCST, Tsinghua University, Beijing, China
X
Xuan Ouyang
CoAI group, DCST, Tsinghua University, Beijing, China
Zhexin Zhang
Zhexin Zhang
Tsinghua University, CoAI Group
NLPAI Safety & Alignment
Hongning Wang
Hongning Wang
Associate Professor, Department of Computer Science and Technology, Tsinghua University
Machine LearningInformation RetrievalLarge Language Models
M
Minlie Huang
CoAI group, DCST, Tsinghua University, Beijing, China