🤖 AI Summary
This study investigates how hyperparameter tuning affects intra-version (IVDP) versus cross-version (CVDP) software defect prediction (SDP), challenging the implicit assumption that tuning strategies are universally effective across SDP scenarios. Method: We conduct a large-scale empirical study across 53 datasets, evaluating 28 machine learning algorithms under two tuning methods and five optimization metrics, augmented by rigorous statistical analysis. Contribution/Results: Results show significantly greater performance gains from tuning in IVDP than in CVDP; small-scale datasets exhibit higher sensitivity to hyperparameters; and tuning benefits for 24 algorithms are not reproducible across scenarios. This is the first empirical demonstration that tuning efficacy is highly scenario-dependent. Consequently, we advocate tailoring tuning strategy evaluation and selection to the specific SDP context—IVDP or CVDP—rather than to the algorithm alone. Our findings provide empirical grounding and methodological guidance for enhancing model robustness, generalizability, and practical utility in SDP.
📝 Abstract
Software defect prediction (SDP) is crucial for delivering high-quality software products. Recent research has indicated that prediction performance improvements in SDP are achievable by applying hyperparameter tuning to a particular SDP scenario. However, the positive impact resulting from the hyperparameter tuning step may differ based on the targeted SDP scenario. Comparing the impact of hyperparameter tuning across SDP scenarios is necessary to provide comprehensive insights and enhance the robustness, generalizability, and, eventually, the practicality of SDP modeling for quality assurance.
Therefore, in this study, we contrast the impact of hyperparameter tuning across two pivotal and consecutive SDP scenarios: (1) Inner Version Defect Prediction (IVDP) and (2) Cross Version Defect Prediction (CVDP). The main distinctions between the two scenarios lie in the scope of defect prediction and the selected evaluation setups. This study's experiments use common evaluation setups, 28 machine learning (ML) algorithms, 53 post-release software datasets, two tuning algorithms, and five optimization metrics. We apply statistical analytics to compare the SDP performance impact differences by investigating the overall impact, the single ML algorithm impact, and variations across different software dataset sizes.
The results indicate that the SDP gains within the IVDP scenario are significantly larger than those within the CVDP scenario. The results reveal that asserting performance gains for up to 24 out of 28 ML algorithms may not hold across multiple SDP scenarios. Furthermore, we found that small software datasets are more susceptible to larger differences in performance impacts. Overall, the study findings recommend software engineering researchers and practitioners to consider the effect of the selected SDP scenario when expecting performance gains from hyperparameter tuning.