CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing backdoor attacks on vision-language models often suffer from insufficient stealth due to image-text mismatches that are easily detectable. This work proposes the first clean-label backdoor attack based on diffusion models, which modifies score embeddings during the reverse generation process to embed trigger features while incorporating textual guidance for multimodal alignment. To enhance visual naturalness, the method employs GradCAM to generate semantic-aware masks, perturbing only critical image regions. Evaluated on MSCOCO and VQA v2 benchmarks, the approach achieves over 80% attack success rates across four state-of-the-art models, all while preserving their normal task performance and significantly improving both stealthiness and effectiveness.
📝 Abstract
Vision-Language Models (VLMs) have achieved remarkable success in tasks such as image captioning and visual question answering (VQA). However, as their applications become increasingly widespread, recent studies have revealed that VLMs are vulnerable to backdoor attacks. Existing backdoor attacks on VLMs primarily rely on data poisoning by adding visual triggers and modifying text labels, where the induced image-text mismatch makes poisoned samples easy to detect. To address this limitation, we propose the Clean-Label Backdoor Attack on VLMs via Diffusion Models (CBV), which leverages diffusion models to generate natural poisoned examples via score matching. Specifically, CBV modifies the score during the reverse generation process of the diffusion model to guide the generation of poisoned samples that contain triggered image features. To further enhance the effectiveness of the attack, we incorporate the textual information of the triggered images as multimodal guidance during generation. Moreover, to enhance stealthiness, we introduce a GradCAM-guided Mask (GM) that restricts modifications to only the most semantically important regions, rather than the entire image. We evaluate our method on MSCOCO and VQA v2 with four representative VLMs, achieving over 80% ASR while preserving normal functionality.
Problem

Research questions and friction points this paper is trying to address.

Clean-label Backdoor Attack
Vision-Language Models
Diffusion Models
Data Poisoning
Stealthiness
Innovation

Methods, ideas, or system contributions that make the work stand out.

clean-label backdoor attack
vision-language models
diffusion models
multimodal guidance
GradCAM-guided mask
J
Ji Guo
Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, China
X
Xiaolong Qin
School of Software Engineering, Chengdu University of Information Technology, China
C
Cencen Liu
Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, China
J
Jielei Wang
Laboratory of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, China
Jierun Chen
Jierun Chen
HKUST
Multi-modal ModelsLarge Language ModelsEfficient AI
Wenbo Jiang
Wenbo Jiang
University of Electronic Science and Technology of China
AI securityBackdoor attack