JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the limitations of existing image-space jailbreaking attacks against vision-language models (VLMs)—namely, high perceptibility of perturbations, low success rates, and poor robustness—this paper proposes JaiLIP, the first end-to-end framework that jointly optimizes image distortion (MSE) and harmful-output loss to generate adversarial images that are visually imperceptible yet highly toxic. Leveraging gradient-guided perturbation and quantitative toxicity evaluation via Perspective API and Detoxify, JaiLIP achieves substantial improvements across major VLMs: average attack success rate increases by 32.7%, perceptual distortion drops below 0.05 (LPIPS), and toxicity scores rise by 41.2% over prior methods. The approach is validated in real-world traffic scenarios, demonstrating practical deployability. Moreover, it uncovers critical security vulnerabilities in multimodal alignment, providing empirical foundations for developing robust defense mechanisms.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have remarkable abilities in generating multimodal reasoning tasks. However, potential misuse or safety alignment concerns of VLMs have increased significantly due to different categories of attack vectors. Among various attack vectors, recent studies have demonstrated that image-based perturbations are particularly effective in generating harmful outputs. In the literature, many existing techniques have been proposed to jailbreak VLMs, leading to unstable performance and visible perturbations. In this study, we propose Jailbreaking with Loss-guided Image Perturbation (JaiLIP), a jailbreaking attack in the image space that minimizes a joint objective combining the mean squared error (MSE) loss between clean and adversarial image with the models harmful-output loss. We evaluate our proposed method on VLMs using standard toxicity metrics from Perspective API and Detoxify. Experimental results demonstrate that our method generates highly effective and imperceptible adversarial images, outperforming existing methods in producing toxicity. Moreover, we have evaluated our method in the transportation domain to demonstrate the attacks practicality beyond toxic text generation in specific domain. Our findings emphasize the practical challenges of image-based jailbreak attacks and the need for efficient defense mechanisms for VLMs.

Problem

Research questions and friction points this paper is trying to address.

Generating imperceptible adversarial images to jailbreak VLMs

Minimizing joint loss for effective harmful output generation

Evaluating attack practicality beyond toxic text in specific domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Loss-guided image perturbation for jailbreaking

Minimizes joint MSE and harmful-output loss

Generates imperceptible adversarial images effectively

🔎 Similar Papers

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image