Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained vision-language models (e.g., CLIP) are vulnerable to backdoor attacks, and existing fine-tuning–based defenses suffer from degraded clean accuracy and insufficient robustness under data-scarce settings. To address this, we propose Class-aware Backdoor Purification via Prompt Tuning (CBPT), the first defense that employs learnable, class-level textual prompts for backdoor mitigation. CBPT reconstructs triggers via reverse engineering and refines prompts through contrastive learning—realigning decision boundaries without updating model parameters. It unifies zero-shot and few-shot backdoor detection and purification, balancing efficiency and generalizability. Evaluated across seven prevalent backdoor attacks, CBPT achieves an average clean accuracy of 58.86% and reduces attack success rate to only 0.39%, substantially outperforming full-model fine-tuning baselines.

Technology Category

Application Category

📝 Abstract
While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit excellent representational capabilities for multimodal data, recent studies have shown that they are vulnerable to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model, yet offer only marginal resistance to state-of-the-art attacks and often result in a decrease in clean accuracy, particularly in data-limited scenarios. Their failure may be attributed to the mismatch between insufficient fine-tuning data and massive parameters in VLMs. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT) defense, an efficient and effective method that operates on the text prompts to indirectly purify the poisoned VLMs. Specifically, we first employ the advanced contrastive learning via our carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we utilize the efficient prompt tuning technique to optimize these class-wise text prompts for modifying the model's decision boundary to further reclassify the feature regions of backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen model robustness against backdoor attacks.
Problem

Research questions and friction points this paper is trying to address.

Purifies backdoors in Vision-Language Models.
Uses prompt tuning to enhance model security.
Preserves model utility while mitigating threats.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Class-wise Backdoor Prompt Tuning
Contrastive learning for inversion
Optimize text prompts modification
🔎 Similar Papers
No similar papers found.
Jiawei Kong
Jiawei Kong
Tsinghua University
Trustworthy AI
H
Hao Fang
Harbin Institute of Technology, Shenzhen
S
Sihang Guo
Harbin Institute of Technology, Shenzhen
C
Chenxi Qing
Harbin Institute of Technology, Shenzhen
B
Bin Chen
Tsinghua Shenzhen International Graduate School, Tsinghua University
B
Bin Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security