Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work identifies a novel semantic-level backdoor attack surface in vision-language models (VLMs), exploiting cross-modal image-text semantic mismatch as an implicit trigger—distinct from conventional pixel-level perturbations. Method: The attack injects poisoned data during training by deliberately misaligning image-text pairs, enabling invisible, pixel-unmodified semantic manipulation without altering input pixels. It systematically exploits semantic inconsistency vulnerabilities inherent in VLM cross-modal fusion, supported by the SIMBad benchmark dataset, attribute-level semantic control, and attention visualization analysis. Contribution/Results: The proposed backdoor achieves >98% average attack success rate across four mainstream VLMs. It exhibits strong cross-distribution generalization and cross-modal transferability, and remains fully robust against prevalent defenses—including prompt engineering and supervised fine-tuning—demonstrating unprecedented stealth and resilience.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have shown remarkable performance, but are also vulnerable to backdoor attacks whereby the adversary can manipulate the model's outputs through hidden triggers. Prior attacks primarily rely on single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs largely unexplored. Unlike prior work, we identify a novel attack surface that leverages cross-modal semantic mismatches as implicit triggers. Based on this insight, we propose BadSem (Backdoor Attack with Semantic Manipulation), a data poisoning attack that injects stealthy backdoors by deliberately misaligning image-text pairs during training. To perform the attack, we construct SIMBad, a dataset tailored for semantic manipulation involving color and object attributes. Extensive experiments across four widely used VLMs show that BadSem achieves over 98% average ASR, generalizes well to out-of-distribution datasets, and can transfer across poisoning modalities. Our detailed analysis using attention visualization shows that backdoored models focus on semantically sensitive regions under mismatched conditions while maintaining normal behavior on clean inputs. To mitigate the attack, we try two defense strategies based on system prompt and supervised fine-tuning but find that both of them fail to mitigate the semantic backdoor. Our findings highlight the urgent need to address semantic vulnerabilities in VLMs for their safer deployment.

Problem

Research questions and friction points this paper is trying to address.

VLMs vulnerable to cross-modal semantic backdoor attacks

Existing attacks overlook cross-modal fusion vulnerabilities

Defense strategies fail against semantic manipulation attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages cross-modal semantic mismatches as triggers

Proposes BadSem for stealthy data poisoning

Uses SIMBad dataset for semantic manipulation

🔎 Similar Papers

Backdooring Vision-Language Models with Out-Of-Distribution Data