🤖 AI Summary
Existing Just-In-Time Defect Prediction (JIT-DP) studies largely overlook the confounding effect of code refactoring on defect labeling, leading to biased training data and distorted model evaluation. This work first systematically characterizes the entanglement of refactoring in defect-introducing and defect-fixing commits, revealing its propagation mechanisms. We propose the Change-Aware Tagging (CAT) framework—a semantics-driven code change analysis approach that integrates static analysis and dynamic program slicing to precisely identify refactorings and their impact scope, enabling principled defect label correction. Our method significantly improves labeling accuracy (by 13.7% on JIT-Defects4J) and boosts average F1-score by 32.5% across six state-of-the-art JIT-DP models, with recall improvements up to 43.2%. This establishes a new paradigm for building robust, interpretable defect prediction models grounded in high-fidelity, refactoring-aware defect labels.
📝 Abstract
Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage. Although code change metrics and semantic features have enhanced prediction accuracy, prior research has largely ignored code refactoring during both the evaluation and methodology phases, despite its prevalence. Refactoring and its propagation often tangle with bug-fixing and bug-inducing changes within the same commit and statement. Neglecting refactoring can introduce bias into the learning and evaluation of JIT-DP models. To address this gap, we investigate the impact of refactoring and its propagation on six state-of-the-art JIT-DP approaches. We propose Code chAnge Tactics (CAT) analysis to categorize code refactoring and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%. Our experiments reveal that failing to consider refactoring information in the dataset can diminish the performance of models, particularly semantic-based models, by 18.6% and 37.3% in F1-score. Additionally, we propose integrating refactoring information to enhance six baseline approaches, resulting in overall improvements in recall and F1-score, with increases of up to 43.2% and 32.5%, respectively. Our research underscores the importance of incorporating refactoring information in the methodology and evaluation of JIT-DP. Furthermore, our CAT has broad applicability in analyzing refactoring and its propagation for software maintenance.