🤖 AI Summary
Deep learning systems exhibit insufficient robustness in safety-critical applications, and existing test input generators (TIGs) lack systematic, cross-dimensional evaluation across data complexity levels. Method: This work presents the first comprehensive empirical assessment of four state-of-the-art TIGs—DeepHunter, DeepFault, AdvGAN, and SinVAD—across four dimensions: defect detection capability, naturalness, diversity, and efficiency. Experiments span three datasets (MNIST, CIFAR-10, ImageNet-1K) and three models (LeNet-5, VGG16, EfficientNetB3), employing black-box/white-box testing frameworks, adversarial generation techniques, quantitative metrics (e.g., LPIPS, FID), and multi-scale robustness validation. Results: TIG performance degrades significantly with increasing data complexity; DeepHunter demonstrates the most robust fault detection, SinVAD achieves the best trade-off between naturalness and efficiency, and AdvGAN suffers pronounced quality degradation in high-dimensional settings. The study proposes a principled TIG selection guideline for safety-critical applications, thereby filling a critical gap in the evaluation framework for robustness-testing tools.
📝 Abstract
Deep Learning (DL) systems are increasingly deployed in safety-critical applications, yet they remain vulnerable to robustness issues that can lead to significant failures. While numerous Test Input Generators (TIGs) have been developed to evaluate DL robustness, a comprehensive assessment of their effectiveness across different dimensions is still lacking. This paper presents a comprehensive assessment of four state-of-the-art TIGs--DeepHunter, DeepFault, AdvGAN, and SinVAD--across multiple critical aspects: fault-revealing capability, naturalness, diversity, and efficiency. Our empirical study leverages three pre-trained models (LeNet-5, VGG16, and EfficientNetB3) on datasets of varying complexity (MNIST, CIFAR-10, and ImageNet-1K) to evaluate TIG performance. Our findings reveal important trade-offs in robustness revealing capability, variation in test case generation, and computational efficiency across TIGs. The results also show that TIG performance varies significantly with dataset complexity, as tools that perform well on simpler datasets may struggle with more complex ones. In contrast, others maintain steadier performance or better scalability. This paper offers practical guidance for selecting appropriate TIGs aligned with specific objectives and dataset characteristics. Nonetheless, more work is needed to address TIG limitations and advance TIGs for real-world, safety-critical systems.