🤖 AI Summary
This work addresses a critical gap in the security evaluation of multimodal large language models (MLLMs), as existing jailbreaking attacks predominantly rely on single-image inputs and fail to expose alignment vulnerabilities in multi-image scenarios. To this end, we propose the DMN framework—the first approach to orchestrate jailbreak attacks through coordinated multi-image inputs. DMN decomposes malicious queries into distributed instructions embedded across multiple images, enhances semantic coherence by fusing multimodal evidence, and disrupts the model’s safety mechanisms via digit-chain visual reasoning tasks. This compositional attack achieves over 90% success rates on GPT-4o, Gemini-2.5-Pro, and Claude Sonnet 4, substantially outperforming current methods and revealing severe security flaws in state-of-the-art MLLMs when processing multi-image inputs.
📝 Abstract
Multimodal Large Language Models (MLLMs) are vulnerable to jailbreak attacks, which can elicit harmful responses from MLLMs. Many MLLMs support multi-image inputs, inadvertently introducing new vulnerabilities due to less efforts on multi-image safety alignment. Previous MLLM jailbreak methods only uses a single image, which restricts the attack space: they cannot distribute harmful requests across multiple images, carry abundant information, or exploit additional visual reasoning tasks to distract MLLMs. To address these limitations, in this paper, we propose a compositional jailbreak framework, \textbf{DMN}, which leverages \textbf{D}istributed instruction, \textbf{M}ultimodal evidence and a \textbf{N}umber chain task to fully enhance the jailbreak performance. Extensive experiments show that DMN is highly effective for MLLM jailbreaking, e.g. achieving attack success rates of over 90\% on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4, surpassing other baselines by a large margin. This compositional, multi-image jailbreak strategy reveals fundamental weaknesses in their safety mechanisms.