🤖 AI Summary
This work addresses backdoor attacks leveraging diverse embedding mechanisms in deep neural networks, proposing a training-data-free and prior-knowledge-agnostic universal backdoor detection and reverse-engineering method. The core problem tackled is the lack of robust, model-agnostic approaches capable of identifying triggers and localizing target classes without access to clean data or attack specifications. Our method introduces consensus-based embedding perturbation into backdoor feature space inversion, comprising three key components: (i) embedding-layer perturbation modeling, (ii) consensus-clustering-guided reverse optimization, and (iii) unsupervised target-class discrimination. This enables simultaneous trigger reconstruction and interpretable target-class identification. Evaluated on CIFAR-10 and CIFAR-100 against BadNets, Blend, SIG, and other state-of-the-art attacks, our approach achieves >96% detection accuracy—surpassing existing SOTA methods—and demonstrates strong robustness and cross-attack generalizability.
📝 Abstract
A variety of defenses have been proposed against Trojans planted in (backdoor attacks on) deep neural network (DNN) classifiers. Backdoor-agnostic methods seek to reliably detect and/or to mitigate backdoors irrespective of the incorporation mechanism used by the attacker, while inversion methods explicitly assume one. In this paper, we describe a new detector that: relies on embedded feature representations to estimate (invert) the backdoor and to identify its target class; can operate without access to the training dataset; and is highly effective for various incorporation mechanisms (i.e., is backdoor agnostic). Our detection approach is evaluated -- and found to be favorable - in comparison with an array of published defenses for a variety of different attacks on the CIFAR-10 and CIFAR-100 image-classification domains.