🤖 AI Summary
This work proposes NERO-Net, a novel approach that integrates neuroevolution with intrinsic robustness design to address the widespread vulnerability of neural networks under adversarial attacks. Unlike conventional methods relying on adversarial training, NERO-Net employs a standard training procedure and uses both clean accuracy and FGSM robust accuracy as a composite fitness function to efficiently search for CNN architectures inherently robust to ℓ∞-bounded perturbations. On CIFAR-10, the evolved architecture achieves 93% clean accuracy and 47% FGSM robust accuracy without any adversarial training. Remarkably, after subsequent adversarial fine-tuning, it maintains 40% robust accuracy under the stronger AutoAttack benchmark, demonstrating that the architectural design itself contributes significantly and independently to model robustness.
📝 Abstract
Neuroevolution automates the complex task of neural network design but often ignores the inherent adversarial fragility of evolved models which is a barrier to adoption in safety-critical scenarios. While robust training methods have received significant attention, the design of architectures exhibiting intrinsic robustness remains largely unexplored. In this paper, we propose NERO-Net, a neuroevolutionary approach to design convolutional neural networks better equipped to resist adversarial attacks. Our search strategy isolates architectural influence on robustness by avoiding adversarial training during the evolutionary loop. As such, our fitness function promotes candidates that, even trained with standard (non-robust) methods, achieve high post-attack accuracy without sacrificing the accuracy on clean samples. We assess NERO-Net on CIFAR-10 with a specific focus on $L_\infty$-robustness. In particular, the fittest individual emerged from evolutionary search with 33% accuracy against FGSM, used as an efficient estimator for robustness during the search phase, while maintaining 87% clean accuracy. Further standard training of this individual boosted these metrics to 47% adversarial and 93% clean accuracy, suggesting inherent architectural robustness. Adversarial training brings the overall accuracy of the model up to 40% against AutoAttack.