Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Vision-language pre-trained models (e.g., CLIP) are vulnerable to backdoor attacks during third-party fine-tuning, yet existing detection methods rely on unrealistic priors—such as access to training data, trigger patterns, or target classes—limiting practical applicability. Method: We propose AMDET, a model-level backdoor detection framework requiring no prior knowledge. AMDET innovatively leverages two intrinsic properties of the text encoder—“feature assimilation” and attention concentration—combined with gradient inversion to recover implicit activations, attention weight analysis, and loss landscape modeling for end-to-end detection. Contribution/Results: Evaluated on 3,600 fine-tuned models, AMDET achieves an 89.90% F1 score, completes each detection in ≈5 minutes, and demonstrates strong robustness against adaptive attacks. To our knowledge, this is the first work to exploit text-encoder-specific behavioral signatures for model-agnostic, prior-free backdoor detection in vision-language models.

Technology Category

Application Category

📝 Abstract

Vision-language pretrained models (VLPs) such as CLIP have achieved remarkable success, but are also highly vulnerable to backdoor attacks. Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem. Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets, or downstream classifiers, which may be impractical for real-world applications. To address this, To address this challenge, we introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge. Specifically, we first reveal the feature assimilation property in backdoored text encoders: the representations of all tokens within a backdoor sample exhibit a high similarity. Further analysis attributes this effect to the concentration of attention weights on the trigger token. Leveraging this insight, AMDET scans a model by performing gradient-based inversion on token embeddings to recover implicit features that capable of activating backdoor behaviors. Furthermore, we identify the natural backdoor feature in the OpenAI's official CLIP model, which are not intentionally injected but still exhibit backdoor-like behaviors. We then filter them out from real injected backdoor by analyzing their loss landscapes. Extensive experiments on 3,600 backdoored and benign-finetuned models with two attack paradigms and three VLP model structures show that AMDET detects backdoors with an F1 score of 89.90%. Besides, it achieves one complete detection in approximately 5 minutes on a RTX 4090 GPU and exhibits strong robustness against adaptive attacks. Code is available at: https://github.com/Robin-WZQ/AMDET

Problem

Research questions and friction points this paper is trying to address.

Detects backdoors in vision-language models without prior knowledge

Identifies feature assimilation in backdoored text encoders for detection

Filters natural backdoor features from intentionally injected ones effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects backdoors via token feature assimilation analysis

Uses gradient inversion to recover implicit trigger features

Filters natural backdoor features by loss landscape analysis

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?