Towards Assessing Deep Learning Test Input Generators

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep learning systems exhibit insufficient robustness in safety-critical applications, and existing test input generators (TIGs) lack systematic, cross-dimensional evaluation across data complexity levels. Method: This work presents the first comprehensive empirical assessment of four state-of-the-art TIGs—DeepHunter, DeepFault, AdvGAN, and SinVAD—across four dimensions: defect detection capability, naturalness, diversity, and efficiency. Experiments span three datasets (MNIST, CIFAR-10, ImageNet-1K) and three models (LeNet-5, VGG16, EfficientNetB3), employing black-box/white-box testing frameworks, adversarial generation techniques, quantitative metrics (e.g., LPIPS, FID), and multi-scale robustness validation. Results: TIG performance degrades significantly with increasing data complexity; DeepHunter demonstrates the most robust fault detection, SinVAD achieves the best trade-off between naturalness and efficiency, and AdvGAN suffers pronounced quality degradation in high-dimensional settings. The study proposes a principled TIG selection guideline for safety-critical applications, thereby filling a critical gap in the evaluation framework for robustness-testing tools.

Technology Category

Application Category

📝 Abstract
Deep Learning (DL) systems are increasingly deployed in safety-critical applications, yet they remain vulnerable to robustness issues that can lead to significant failures. While numerous Test Input Generators (TIGs) have been developed to evaluate DL robustness, a comprehensive assessment of their effectiveness across different dimensions is still lacking. This paper presents a comprehensive assessment of four state-of-the-art TIGs--DeepHunter, DeepFault, AdvGAN, and SinVAD--across multiple critical aspects: fault-revealing capability, naturalness, diversity, and efficiency. Our empirical study leverages three pre-trained models (LeNet-5, VGG16, and EfficientNetB3) on datasets of varying complexity (MNIST, CIFAR-10, and ImageNet-1K) to evaluate TIG performance. Our findings reveal important trade-offs in robustness revealing capability, variation in test case generation, and computational efficiency across TIGs. The results also show that TIG performance varies significantly with dataset complexity, as tools that perform well on simpler datasets may struggle with more complex ones. In contrast, others maintain steadier performance or better scalability. This paper offers practical guidance for selecting appropriate TIGs aligned with specific objectives and dataset characteristics. Nonetheless, more work is needed to address TIG limitations and advance TIGs for real-world, safety-critical systems.
Problem

Research questions and friction points this paper is trying to address.

Assessing effectiveness of DL test input generators
Evaluating TIGs across fault-revealing and efficiency metrics
Analyzing TIG performance variation with dataset complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive assessment of four TIGs
Evaluates fault-revealing, naturalness, diversity, efficiency
Guidance for TIG selection based on datasets
🔎 Similar Papers
No similar papers found.