🤖 AI Summary
Multimodal vision-language models (VLMs) face emerging jailbreaking security threats, yet existing red-teaming methods suffer from rigid templates, limited scenario coverage, and insufficient vulnerability exploration. This paper proposes VERA-V—the first multimodal jailbreaking framework grounded in variational inference—formulating adversarial attacks as joint vision-language posterior distribution learning. A lightweight attacker approximates this posterior to generate stealthy, coordinated, and diverse multimodal adversarial examples. Innovatively integrating text-layout steganography, diffusion-based image synthesis, and structured attention perturbation, VERA-V enhances attack robustness and generalizability. Evaluated on HarmBench and HADES benchmarks, VERA-V significantly outperforms state-of-the-art methods: on GPT-4o, it achieves a 53.75% higher attack success rate than the best baseline.
📝 Abstract
Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.