Deep Learning Framework Testing via Model Mutation: How Far Are We?

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep learning (DL) framework defect detection faces challenges including low specificity of mutation testing—leading to numerous invalid mutations, high false-positive rates, and frequent overlooking of detected defects. This paper addresses these issues by constructing, for the first time, a developer-validated dataset of critical defects annotated with priority ratings, enabling systematic evaluation of existing model mutation techniques. We identify the root cause of excessive false positives as the indiscriminate application of generic mutation operators, and propose a framework-aware, customized mutation strategy. Through an empirical study integrating model mutation testing, log analysis, defect classification, and manual validation across 23 mainstream DL models, we uncover 39 unique defects (31 confirmed, 8 fixed). Our optimized approach further identifies seven new defects—including four high-priority ones, three of which have already been resolved—significantly improving detection efficacy for critical defects and enhancing practical impact.

Technology Category

Application Category

📝 Abstract
Deep Learning (DL) frameworks are a fundamental component of DL development. Therefore, the detection of DL framework defects is important and challenging. As one of the most widely adopted DL testing techniques, model mutation has recently gained significant attention. In this study, we revisit the defect detection ability of existing mutation-based testing methods and investigate the factors that influence their effectiveness. To begin with, we reviewed existing methods and observed that many of them mutate DL models (e.g., changing their parameters) without any customization, ignoring the unique challenges in framework testing. Another issue with these methods is their limited effectiveness, characterized by a high rate of false positives caused by illegal mutations arising from the use of generic, non-customized mutation operators. Moreover, we tracked the defects identified by these methods and discovered that most of them were ignored by developers. Motivated by these observations, we investigate the effectiveness of existing mutation-based testing methods in detecting important defects that have been authenticated by framework developers. We begin by collecting defect reports from three popular frameworks and classifying them based on framework developers' ratings to build a comprehensive dataset. We then perform an in-depth analysis to uncover valuable insights. Based on our findings, we propose optimization strategies to address the shortcomings of existing approaches. Following these optimizations, we identified seven new defects, four of which were confirmed by developers as high-priority issues, with three resolved. In summary, we identified 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
Problem

Research questions and friction points this paper is trying to address.

Evaluating defect detection in DL framework testing
Addressing high false positives in mutation-based methods
Improving mutation strategies for developer-validated defects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized model mutation for DL framework testing
Optimized mutation operators to reduce false positives
Developer-verified defect classification and prioritization
🔎 Similar Papers
No similar papers found.
Yanzhou Mu
Yanzhou Mu
Nanjing university
deep learning testingSE4AIconcurrency testingsoftware defect prediction
R
Rong Wang
School of Information Science and Technology, Nantong University, China
Juan Zhai
Juan Zhai
University of Massachusetts, Amherst
software text analyticssoftware reliabilitydeep learning
Chunrong Fang
Chunrong Fang
Software Institute, Nanjing University
Software TestingSoftware EngineeringComputer Science
X
Xiang Chen
School of Artificial Intelligence and Computer Science, Nantong University, China
Z
Zhiyuan Peng
State Key Laboratory for Novel Software Technology Nanjing University, China
Peiran Yang
Peiran Yang
Nanjing Univerisity
AI4SE
Ruixiang Qian
Ruixiang Qian
Nanjing University
FuzzingSoftware TestingProgram Analysis
Shaoyu Yang
Shaoyu Yang
Nanjing University
AI InfraFuzz TestingLarge Language ModelsCode IntelligenceMining Software Repositories
Z
Zhenyu Chen
State Key Laboratory for Novel Software Technology Nanjing University, China