🤖 AI Summary
Pretrained vision-language models (e.g., CLIP) are vulnerable to backdoor attacks, and existing fine-tuning–based defenses suffer from degraded clean accuracy and insufficient robustness under data-scarce settings. To address this, we propose Class-aware Backdoor Purification via Prompt Tuning (CBPT), the first defense that employs learnable, class-level textual prompts for backdoor mitigation. CBPT reconstructs triggers via reverse engineering and refines prompts through contrastive learning—realigning decision boundaries without updating model parameters. It unifies zero-shot and few-shot backdoor detection and purification, balancing efficiency and generalizability. Evaluated across seven prevalent backdoor attacks, CBPT achieves an average clean accuracy of 58.86% and reduces attack success rate to only 0.39%, substantially outperforming full-model fine-tuning baselines.
📝 Abstract
While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit excellent representational capabilities for multimodal data, recent studies have shown that they are vulnerable to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model, yet offer only marginal resistance to state-of-the-art attacks and often result in a decrease in clean accuracy, particularly in data-limited scenarios. Their failure may be attributed to the mismatch between insufficient fine-tuning data and massive parameters in VLMs. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT) defense, an efficient and effective method that operates on the text prompts to indirectly purify the poisoned VLMs. Specifically, we first employ the advanced contrastive learning via our carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we utilize the efficient prompt tuning technique to optimize these class-wise text prompts for modifying the model's decision boundary to further reclassify the feature regions of backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen model robustness against backdoor attacks.