🤖 AI Summary
This study addresses the challenge of distinguishing intrinsic software defects from extrinsic issues caused by external dependencies or environmental factors in existing defect datasets. The authors construct a manually annotated dataset comprising 377 GitHub issues across 103 NPM repositories, systematically categorizing each issue as intrinsic defect, extrinsic problem, non-defect, or unknown for the first time. The dataset also integrates temporal metadata, including maintainer engagement, code changes, and reopening behavior. Analysis reveals that intrinsic defects are resolved faster (8.9 vs. 10.2 days), have higher closure rates (92% vs. 78%), and are more frequently accompanied by code modifications (57% vs. 28%). In contrast, extrinsic problems exhibit higher reopening rates (12% vs. 4%) and longer recurrence delays (157 vs. 87 days). This resource establishes a new benchmark for fine-grained defect analysis.
📝 Abstract
Understanding the causes of software defects is essential for reliable software maintenance and ecosystem stability. However, existing bug datasets do not distinguish between issues originating within a project from those caused by external dependencies or environmental factors. In this paper we present InEx-Bug, a manually annotated dataset of 377 GitHub issues from 103 NPM repositories, categorizing issues as Intrinsic (internal defect), Extrinsic (dependency/environment issue), Not-a-Bug, or Unknown. Beyond labels, the dataset includes rich temporal and behavioral metadata such as maintainer participation, code changes, and reopening patterns. Analyses show Intrinsic bugs resolve faster (median 8.9 vs 10.2 days), are close more often (92% vs 78%), and require code changes more frequently (57% vs 28%) compared to Extrinsic bugs. While Extrinsic bugs exhibit higher reopen rates (12% vs 4%) and delayed recurrence (median 157 vs 87 days). The dataset provides a foundation for further studying Intrinsic and Extrinsic defects in the NPM ecosystem.