🤖 AI Summary
Existing statistical mutation-killing criteria, such as DeepCrime, violate monotonicity: expanding the test set may overturn prior “killed” judgments to “survived”, undermining reliability and interpretability of DNN mutation testing.
Method: This paper introduces Fisher’s exact test—the first application of rigorous statistical hypothesis testing in DNN mutation testing—to establish a monotonic framework. It models output-behavior discrepancies under input perturbations and strictly controls Type-I error rates.
Contribution/Results: The proposed method guarantees that once a mutant is declared “killed”, it remains so under any superset of the test data, thereby resolving the non-monotonicity issue fundamentally while preserving statistical rigor. Empirical evaluation on CIFAR-10 and ImageNet demonstrates significantly improved stability and reproducibility of mutation detection. Our approach establishes a theoretically sound and practically viable paradigm for assessing the effectiveness of DNN testing.
📝 Abstract
Mutation testing has emerged as a powerful technique for evaluating the effectiveness of test suites for Deep Neural Networks. Among existing approaches, the statistical mutant killing criterion of DeepCrime has leveraged statistical testing to determine whether a mutant significantly differs from the original model. However, it suffers from a critical limitation: it violates the monotonicity property, meaning that expanding a test set may result in previously killed mutants no longer being classified as killed. In this technical report, we propose a new formulation of statistical mutant killing based on Fisher exact test that preserves the statistical rigour of it while ensuring monotonicity.