๐ค AI Summary
This work addresses the limitations of traditional file-level defect prediction models, which often overlook the impact of commit sizeโthe number of files modified in a single commitโon software quality and struggle to capture higher-order change semantics. To overcome these issues, the authors propose a commit-size-aware defect prediction approach that reformulates process metrics into high-dimensional vectors incorporating commit size and constructs a hyper-co-change graph to naturally encode size information. File importance is then quantified via graph centrality measures derived from this representation. Empirical evaluation on nine long-term Apache projects demonstrates that the proposed method significantly outperforms current baselines, achieving statistically significant improvements in prediction performance, model discriminative power, and calibration.
๐ Abstract
File-level defect prediction models traditionally rely on product and process metrics. While process metrics effectively complement product metrics, they often overlook commit size the number of files changed per commit despite its strong association with software quality. Network centrality measures on dependency graphs have also proven to be valuable product level indicators. Motivated by this, we first redefine process metrics as commit size aware process metric vectors, transforming conventional scalar measures into 100 dimensional profiles that capture the distribution of changes across commit size strata. We then model change history as a hyper co change graph, where hyperedges naturally encode commit-size semantics. Vector centralities computed on these hypergraphs quantify size-aware node importance for source files. Experiments on nine long-lived Apache projects using five popular classifiers show that replacing scalar process metrics with the proposed commit size aware vectors, alongside product metrics, consistently improves predictive performance. These findings establish that commit size aware process metrics and hypergraph based vector centralities capture higher-order change semantics, leading to more discriminative, better calibrated, and statistically superior defect prediction models.