🤖 AI Summary
Software defect prediction (SDP) models are frequently degraded by the concurrent presence of multiple data quality issues—yet prior work typically examines them in isolation. Method: We conduct the first large-scale empirical study across 374 datasets and five classifier families, systematically analyzing co-occurrence patterns and interaction effects among five key issues: class imbalance, class overlap, redundant features, attribute noise, and outliers. Contribution/Results: We find that >93% of datasets exhibit multi-issue co-occurrence; class overlap >0.20 induces significant performance degradation; outliers surprisingly improve performance under low feature redundancy; and no single model exhibits robustness across all issue combinations. Leveraging interpretable Explainable Boosting Machines (EBM) and hierarchical conditional effect analysis (with default hyperparameters), we establish a novel performance–robustness trade-off perspective. This work provides an empirical benchmark and a diagnostic framework for data-aware SDP modeling.
📝 Abstract
Software Defect Prediction (SDP) models are central to proactive software quality assurance, yet their effectiveness is often constrained by the quality of available datasets. Prior research has typically examined single issues such as class imbalance or feature irrelevance in isolation, overlooking that real-world data problems frequently co-occur and interact. This study presents, to our knowledge, the first large-scale empirical analysis in SDP that simultaneously examines five co-occurring data quality issues (class imbalance, class overlap, irrelevant features, attribute noise, and outliers) across 374 datasets and five classifiers. We employ Explainable Boosting Machines together with stratified interaction analysis to quantify both direct and conditional effects under default hyperparameter settings, reflecting practical baseline usage.
Our results show that co-occurrence is nearly universal: even the least frequent issue (attribute noise) appears alongside others in more than 93% of datasets. Irrelevant features and imbalance are nearly ubiquitous, while class overlap is the most consistently harmful issue. We identify stable tipping points around 0.20 for class overlap, 0.65-0.70 for imbalance, and 0.94 for irrelevance, beyond which most models begin to degrade. We also uncover counterintuitive patterns, such as outliers improving performance when irrelevant features are low, underscoring the importance of context-aware evaluation. Finally, we expose a performance-robustness trade-off: no single learner dominates under all conditions.
By jointly analyzing prevalence, co-occurrence, thresholds, and conditional effects, our study directly addresses a persistent gap in SDP research. Hence, moving beyond isolated analyses to provide a holistic, data-aware understanding of how quality issues shape model performance in real-world settings.