🤖 AI Summary
Deep learning models are vulnerable to adversarial attacks in safety-critical applications, yet effective unsupervised detection methods—especially those not requiring adversarial examples for training—remain scarce. To address this, we propose U-CAN, an unsupervised adversarial detection framework that operates without access to adversarial samples during training. U-CAN embeds a lightweight contrastive auxiliary network into intermediate layers of the target model; guided by ArcFace, this network projects features into a contrastive space where class-discriminative boundaries are implicitly modeled, enabling robust identification of anomalous feature behaviors. The framework is plug-and-play and architecture-agnostic, supporting ResNet, VGG, and ViT. Evaluated on CIFAR-10, Mammals, and an ImageNet subset against four attack types (including FGSM and PGD), U-CAN achieves state-of-the-art F1 scores among unsupervised methods—marking the first instance of high-accuracy, cross-architecture adversarial detection without reliance on adversarial training data.
📝 Abstract
Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks -- imperceptible perturbations that can significantly degrade model performance. Conventional defense mechanisms predominantly focus on either enhancing model robustness or detecting adversarial inputs independently. In this work, we propose an Unsupervised adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover adversarial behavior within auxiliary feature representations, without the need for adversarial examples. U-CAN is embedded within selected intermediate layers of the target model. These auxiliary networks, comprising projection layers and ArcFace-based linear layers, refine feature representations to more effectively distinguish between benign and adversarial inputs. Comprehensive experiments across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method surpasses existing unsupervised adversarial detection techniques, achieving superior F1 scores against four distinct attack methods. The proposed framework provides a scalable and effective solution for enhancing the security and reliability of deep learning systems.