An Empirical Study of the Realism of Mutants in Deep Learning

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper presents the first empirical validation of the core realism hypothesis in deep learning mutation analysis—that mutant behaviors should closely approximate those of real faults. To address this, the study systematically evaluates both pre-training and post-training mutation strategies across four benchmark defect datasets (CleanML, DeepFD, DeepLocalize, and defect4ML), introducing a novel statistical framework to quantify the degree of coupling strength and behavioral similarity between mutants and actual defects. The main contributions are threefold: (1) the first systematic empirical verification of the realism hypothesis in deep learning mutation testing; (2) a reproducible, quantitative evaluation framework grounded in statistical metrics; and (3) experimental evidence demonstrating that pre-training mutants significantly outperform post-training mutants on both coupling and behavioral similarity metrics—thereby confirming their superior realism and exposing fundamental limitations of post-training approaches in modeling real-world faults.

Technology Category

Application Category

📝 Abstract

Mutation analysis is a well-established technique for assessing test quality in the traditional software development paradigm by injecting artificial faults into programs. Its application to deep learning (DL) has expanded beyond classical testing to support tasks such as fault localization, repair, data generation, and model robustness evaluation. The core assumption is that mutants behave similarly to real faults, an assumption well established in traditional software systems but largely unverified for DL. This study presents the first empirical comparison of pre-training and post-training mutation approaches in DL with respect to realism. We introduce a statistical framework to quantify their coupling strength and behavioral similarity to real faults using publicly available bugs datasets: CleanML, DeepFD, DeepLocalize, and defect4ML. Mutants are generated using state-of-the-art tools representing both approaches. Results show that pre-training mutants exhibit consistently stronger coupling and higher behavioral similarity to real faults than post-training mutants, indicating greater realism. However, the substantial computational cost of pre-training mutation underscores the need for more effective post-training operators that match or exceed the realism demonstrated by pre-training mutants.

Problem

Research questions and friction points this paper is trying to address.

Assesses realism of mutants in deep learning testing

Compares pre-training vs post-training mutation approaches empirically

Quantifies coupling and behavioral similarity to real faults

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-training mutants show stronger coupling to real faults

Statistical framework quantifies mutant realism in deep learning

Need effective post-training operators to reduce computational cost

🔎 Similar Papers

No similar papers found.