Predicting post-release defects with knowledge units (KUs) of programming languages: an empirical study

📅 2024-12-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional defect prediction models overlook the semantic structure of programming languages, limiting their ability to capture code-level defect indicators. Method: This paper proposes a novel feature engineering paradigm based on Java Knowledge Units (KUs)—semantically cohesive syntactic/semantic building blocks—extracted via static source-code analysis. KU features are fused with conventional product metrics to build two models: a logistic regression model (KUCLS) and a cost-sensitive ensemble model (KUCLS_CC). Contribution/Results: KUs exhibit independent discriminative power and complementarity with traditional metrics. KUCLS achieves a median AUC of 0.82, outperforming the pure product-metric baseline CC_PROD by 5.1–28.9%. KUCLS_CC further improves AUC by 4.9–33.3% over CC and by 5.6–59.9% over KUCLS. This work pioneers the integration of programming-language knowledge units into defect prediction and establishes a new feature engineering paradigm grounded in language semantics.

Technology Category

Application Category

📝 Abstract
Traditional code metrics (product and process metrics) have been widely used in defect prediction. However, these metrics have an inherent limitation: they do not reveal system traits that are tied to certain building blocks of a given programming language. Taking these building blocks of a programming language into account can lead to further insights about a software system and improve defect prediction. To fill this gap, this paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. This study aims to understand whether we can obtain richer results in defect prediction when using KUs in combination with traditional code metrics. Using a defect dataset covering 28 releases of 8 Java systems, we analyze source code to extract both traditional code metrics and KU incidences. We find empirical evidence that KUs are different and complementary to traditional metrics, thus indeed offering a new lens through which software systems can be analyzed. We build a defect prediction model called KUCLS, which leverages the KU-based features. Our KUCLS achieves a median AUC of 0.82 and significantly outperforms the CC_PROD (model built with product metrics). The normalized AUC improvement of the KUCLS over CC_PROD ranges from 5.1% to 28.9% across the studied releases. Combining KUs with traditional metrics in KUCLS_CC further improves performance, with AUC gains of 4.9% to 33.3% over CC and 5.6% to 59.9% over KUCLS. Finally, we develop a cost-effective model that significantly outperforms the CC. These encouraging results can be helpful to researchers who wish to further study the aspect of feature engineering and building models for defect prediction.
Problem

Research questions and friction points this paper is trying to address.

Explores new data sources for defect prediction in software engineering.
Introduces Knowledge Units (KUs) as novel features for defect prediction.
Compares KUs with traditional metrics to predict post-release defects.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Knowledge Units (KUs) for defect prediction
Combines KUs with traditional metrics for better accuracy
Develops cost-effective model using only 10 features
🔎 Similar Papers
No similar papers found.