UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models

📅 2024-11-03

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the vulnerability of multimodal large language models (MLLMs) to cross-modal jailbreaking attacks, this paper proposes UniGuard, a universal safety guard mechanism. UniGuard employs multimodal joint representation learning to unify the modeling of both unimodal and cross-modal harmful signals; it leverages toxicity-labeled corpora for supervised training and adopts a lightweight prefix-based intervention architecture, enabling plug-and-play, real-time defense at inference time with minimal computational overhead (<3%). To our knowledge, UniGuard is the first approach to achieve universal defense across modalities, MLLM architectures, and attack strategies. Evaluated on mainstream MLLMs—including LLaVA, GPT-4o, and Gemini Pro—it achieves an average defense success rate improvement of 42.6%, while preserving the original models’ visual-language understanding capabilities and functional performance.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have revolutionized vision-language understanding but remain vulnerable to multimodal jailbreak attacks, where adversarial inputs are meticulously crafted to elicit harmful or inappropriate responses. We propose UniGuard, a novel multimodal safety guardrail that jointly considers the unimodal and cross-modal harmful signals. UniGuard trains a multimodal guardrail to minimize the likelihood of generating harmful responses in a toxic corpus. The guardrail can be seamlessly applied to any input prompt during inference with minimal computational costs. Extensive experiments demonstrate the generalizability of UniGuard across multiple modalities, attack strategies, and multiple state-of-the-art MLLMs, including LLaVA, Gemini Pro, GPT-4o, MiniGPT-4, and InstructBLIP. Notably, this robust defense mechanism maintains the models' overall vision-language understanding capabilities.

Problem

Research questions and friction points this paper is trying to address.

Language Model

Ethical Use

Response Filtering

Innovation

Methods, ideas, or system contributions that make the work stand out.

UniGuard

Multi-functional Large Language Models

Security Protection Mechanism

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance