🤖 AI Summary
Multimodal large language models (MLLMs) exhibit insufficient robustness against jailbreaking attacks, and existing defenses fail to withstand sophisticated white-box adversarial attacks. Method: We propose SafeMLLM, a novel defense framework featuring Contrastive Embedding Attack (CoE-Attack)—the first method to formulate a contrastive learning objective in the token embedding space for generating high-quality, differentiable cross-modal adversarial perturbations. SafeMLLM employs end-to-end adversarial training to jointly optimize perturbation generation and model parameters, preserving original task performance while enhancing security. Contribution/Results: Extensive experiments across six state-of-the-art MLLMs and six diverse cross-modal jailbreaking attack types demonstrate that SafeMLLM significantly reduces attack success rates—by up to 72.4% on average—while maintaining strong utility and generalization on benign inputs.
📝 Abstract
While multimodal large language models (MLLMs) have achieved remarkable success in recent advancements, their susceptibility to jailbreak attacks has come to light. In such attacks, adversaries exploit carefully crafted prompts to coerce models into generating harmful or undesirable content. Existing defense mechanisms often rely on external inference steps or safety alignment training, both of which are less effective and impractical when facing sophisticated adversarial perturbations in white-box scenarios. To address these challenges and bolster MLLM robustness, we introduce SafeMLLM by adopting an adversarial training framework that alternates between an attack step for generating adversarial noise and a model updating step. At the attack step, SafeMLLM generates adversarial perturbations through a newly proposed contrastive embedding attack (CoE-Attack), which optimizes token embeddings under a contrastive objective. SafeMLLM then updates model parameters to neutralize the perturbation effects while preserving model utility on benign inputs. We evaluate SafeMLLM across six MLLMs and six jailbreak methods spanning multiple modalities. Experimental results show that SafeMLLM effectively defends against diverse attacks, maintaining robust performance and utilities.