🤖 AI Summary
Adversarial examples severely compromise the robustness of deep neural networks (DNNs), yet existing defenses either rely on attack-specific priors or require architectural modifications, suffering from poor generalizability and computational overhead. This paper proposes a universal, lightweight adversarial sample detection framework that requires no model fine-tuning or additional training. Leveraging statistical distribution analysis of layer-wise activations—including gradient sensitivity modeling and anomaly detection—it enables plug-and-play detection across diverse architectures and modalities (image, video, audio). Our key contribution is the first statistically grounded, assumption-free, and interpretable detection paradigm, derived purely from intrinsic activation statistics, eliminating dependence on prior knowledge of attack types. Extensive evaluation demonstrates >95% detection accuracy across multiple datasets and attack settings, with inference overhead under 0.5% of the original model’s computational cost—significantly outperforming state-of-the-art methods.
📝 Abstract
Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial examples. While numerous successful adversarial attacks have been proposed, defenses against these attacks remain relatively understudied. Existing defense approaches either focus on negating the effects of perturbations caused by the attacks to restore the DNNs' original predictions or use a secondary model to detect adversarial examples. However, these methods often become ineffective due to the continuous advancements in attack techniques. We propose a novel universal and lightweight method to detect adversarial examples by analyzing the layer outputs of DNNs. Through theoretical justification and extensive experiments, we demonstrate that our detection method is highly effective, compatible with any DNN architecture, and applicable across different domains, such as image, video, and audio.