CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Multimodal large language models (MLLMs) face an underexplored security threat—cross-modal implicit jailbreaking—where benign text and image inputs collaboratively trigger unsafe model behaviors. Method: We propose CrossGuard, the first systematic defense framework. It comprises (i) ImpForge, a reinforcement learning–based red-teaming system that generates high-quality implicit attack samples via a customized reward mechanism; and (ii) an intent-aware cross-modal safety guard integrating semantic alignment and fine-grained intent detection. To support evaluation, we construct the first multimodal implicit attack dataset covering 14 domains. Contribution/Results: Extensive experiments demonstrate that CrossGuard significantly outperforms existing methods across multiple benchmarks and cross-domain scenarios. It provides unified protection against both explicit and implicit attacks while preserving high model utility and usability.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.

Problem

Research questions and friction points this paper is trying to address.

Detecting joint-modal implicit malicious attacks on MLLMs

Generating diverse implicit attack samples across domains

Providing robust defense against multimodal threats while maintaining utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline generates diverse implicit attack samples

Intent-aware safeguard defends against multimodal threats

Reinforcement learning with tailored rewards enhances security

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?