Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the vulnerability of large vision-language models (LVLMs) to adversarial jailbreak attacks in black-box settings, where existing white-box methods are impractical due to their reliance on full model access, high computational cost, and poor transferability. The authors propose ZO-SPSA, a black-box jailbreak attack based on zeroth-order optimization that requires only input-output interactions with the target model, eliminating the need for gradient information or surrogate models. By integrating simultaneous perturbation stochastic approximation (SPSA) with image perturbations and prompt injection, ZO-SPSA achieves an 83.0% jailbreak success rate on InstructBLIP. Adversarial examples generated using MiniGPT-4 exhibit a transferable success rate of 64.18% across other LVLMs while remaining visually imperceptible. This study presents the first efficient, low-resource, and highly transferable black-box jailbreak attack against LVLMs.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs

Problem

Research questions and friction points this paper is trying to address.

adversarial jailbreak

Large Vision-Language Models

black-box attack

safety mechanisms

adversarial transferability

Innovation

Methods, ideas, or system contributions that make the work stand out.

black-box attack

zeroth-order optimization

adversarial jailbreak