VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Multimodal vision-language models (VLMs) face emerging jailbreaking security threats, yet existing red-teaming methods suffer from rigid templates, limited scenario coverage, and insufficient vulnerability exploration. This paper proposes VERA-V—the first multimodal jailbreaking framework grounded in variational inference—formulating adversarial attacks as joint vision-language posterior distribution learning. A lightweight attacker approximates this posterior to generate stealthy, coordinated, and diverse multimodal adversarial examples. Innovatively integrating text-layout steganography, diffusion-based image synthesis, and structured attention perturbation, VERA-V enhances attack robustness and generalizability. Evaluated on HarmBench and HADES benchmarks, VERA-V significantly outperforms state-of-the-art methods: on GPT-4o, it achieves a 53.75% higher attack success rate than the best baseline.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

Problem

Research questions and friction points this paper is trying to address.

Addressing underexplored vulnerabilities in multimodal Vision-Language Models

Overcoming limitations of brittle templates in red-teaming methods

Generating stealthy adversarial inputs to bypass model guardrails

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational inference framework for multimodal jailbreak discovery

Learns joint posterior distribution over paired text-image prompts

Integrates typography, diffusion synthesis, and structured distractors

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models