BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Systematic evaluation of backdoor attacks against vision-language models (VLMs) remains lacking. Method: We introduce BackdoorVLM—the first dedicated benchmark for VLM backdoor assessment—covering tasks including image captioning and visual question answering, and five representative attack paradigms. We propose a unified multimodal backdoor taxonomy and evaluation framework integrating 12 attack methods, employing textual, visual, and cross-modal triggers. Experiments are conducted on two open-source VLMs and three mainstream multimodal datasets. Contribution/Results: Our findings reveal that the textual modality dominates cross-modal triggers; even a 1% poisoning rate achieves >90% attack success rate, underscoring VLMs’ extreme sensitivity to textual instructions and critical security vulnerabilities. This work fills a fundamental gap in multimodal model security evaluation and provides a reproducible, extensible benchmark to advance backdoor defense research.

Technology Category

Application Category

📝 Abstract

Backdoor attacks undermine the reliability and trustworthiness of machine learning systems by injecting hidden behaviors that can be maliciously activated at inference time. While such threats have been extensively studied in unimodal settings, their impact on multimodal foundation models, particularly vision-language models (VLMs), remains largely underexplored. In this work, we introduce extbf{BackdoorVLM}, the first comprehensive benchmark for systematically evaluating backdoor attacks on VLMs across a broad range of settings. It adopts a unified perspective that injects and analyzes backdoors across core vision-language tasks, including image captioning and visual question answering. BackdoorVLM organizes multimodal backdoor threats into 5 representative categories: targeted refusal, malicious injection, jailbreak, concept substitution, and perceptual hijack. Each category captures a distinct pathway through which an adversary can manipulate a model's behavior. We evaluate these threats using 12 representative attack methods spanning text, image, and bimodal triggers, tested on 2 open-source VLMs and 3 multimodal datasets. Our analysis reveals that VLMs exhibit strong sensitivity to textual instructions, and in bimodal backdoors the text trigger typically overwhelms the image trigger when forming the backdoor mapping. Notably, backdoors involving the textual modality remain highly potent, with poisoning rates as low as 1% yielding over 90% success across most tasks. These findings highlight significant, previously underexplored vulnerabilities in current VLMs. We hope that BackdoorVLM can serve as a useful benchmark for analyzing and mitigating multimodal backdoor threats. Code is available at: https://github.com/bin015/BackdoorVLM .

Problem

Research questions and friction points this paper is trying to address.

Evaluating backdoor attack vulnerabilities in vision-language models

Analyzing multimodal threats across image captioning and question answering

Benchmarking attack methods using text, image, and bimodal triggers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces first benchmark for multimodal backdoor attacks

Organizes threats into five distinct manipulation categories

Reveals high vulnerability to textual triggers in VLMs

🔎 Similar Papers

Backdooring Vision-Language Models with Out-Of-Distribution Data