Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

πŸ“… 2025-10-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work identifies a novel **cross-modal security asymmetry** in text–vision multimodal large language models (MLLMs), wherein visual alignment disproportionately weakens safety constraints compared to textual inputs, enabling previously unrecognized jailbreaking vulnerabilities. We formally define this phenomenon for the first time and propose a reusable, black-box jailbreaking framework grounded in atomic strategy primitives. Our method integrates attention analysis, latent-space probing, and multi-agent reinforcement learning to automatically synthesize cross-modal adversarial inputs. Evaluated on 12 state-of-the-art open- and closed-source MLLMs, our approach achieves an average jailbreaking success rate improvement of 27.6% over existing SOTA methods. This work establishes a new paradigm for multimodal safety evaluation and defense, highlighting the critical need to account for modality-specific security dynamics in MLLM alignment.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) have demonstrated significant utility across diverse real-world applications. But MLLMs remain vulnerable to jailbreaks, where adversarial inputs can collapse their safety constraints and trigger unethical responses. In this work, we investigate jailbreaks in the text-vision multimodal setting and pioneer the observation that visual alignment imposes uneven safety constraints across modalities in MLLMs, thereby giving rise to multimodal safety asymmetry. We then develop PolyJailbreak, a black-box jailbreak method grounded in reinforcement learning. Initially, we probe the model's attention dynamics and latent representation space, assessing how visual inputs reshape cross-modal information flow and diminish the model's ability to separate harmful from benign inputs, thereby exposing exploitable vulnerabilities. On this basis, we systematize them into generalizable and reusable operational rules that constitute a structured library of Atomic Strategy Primitives, which translate harmful intents into jailbreak inputs through step-wise transformations. Guided by the primitives, PolyJailbreak employs a multi-agent optimization process that automatically adapts inputs against the target models. We conduct comprehensive evaluations on a variety of open-source and closed-source MLLMs, demonstrating that PolyJailbreak outperforms state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Investigates multimodal safety asymmetry in MLLMs causing uneven security constraints
Develops black-box jailbreak method exploiting cross-modal vulnerabilities through reinforcement learning
Systematizes attack strategies into reusable primitives for automated adversarial input generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning-based black-box jailbreak method
Atomic Strategy Primitives translate harmful intents
Multi-agent optimization adapts inputs automatically
πŸ”Ž Similar Papers
2024-01-12International Conference on Computational LinguisticsCitations: 11
Xinkai Wang
Xinkai Wang
Southeast University
Embodied AILLM reasoning
B
Beibei Li
School of Cyber Science and Engineering, Sichuan University, Chengdu 610207, China
Z
Zerui Shao
School of Cyber Science and Engineering, Sichuan University, Chengdu 610207, China
A
Ao Liu
School of Cyber Science and Engineering, Sichuan University, Chengdu 610207, China
Shouling Ji
Shouling Ji
Professor, Zhejiang University & Georgia Institute of Technology
Data-driven SecurityAI SecuritySoftware ScurityPrivacy