🤖 AI Summary
Multi-trigger backdoor attacks—where adversaries embed diverse, stealthy triggers across object classes to evade detection—pose escalating threats to deep neural networks (DNNs) and generative AI systems.
Method: We propose DBOM, a vision-language model-based active defense framework. DBOM freezes the multimodal encoder and jointly employs a learnable visual prompt library with prefix tuning to explicitly model and disentangle trigger and object features within a shared representation space. It introduces a trigger-object separation loss and a diversity constraint to enable zero-shot generalization to both seen and unseen trigger-object combinations.
Contribution/Results: Evaluated on CIFAR-10 and GTSRB, DBOM significantly improves poisoned sample detection accuracy while enhancing the security and robustness of DNN training pipelines against multi-trigger backdoor threats.
📝 Abstract
Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger-object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.