🤖 AI Summary
Although top-tier conferences such as ICSE now commonly require authors to submit replication packages, the actual executability and reproducibility of these packages remain largely unassessed. This study presents a large-scale empirical investigation of 100 replication packages from ICSE papers published between 2015 and 2024, involving approximately 650 person-hours of manual execution, debugging, and root-cause analysis. The findings reveal that only 40% of the packages are executable, with just 32.5% running without modification; 82.5% require moderate to substantial changes. Among the executable packages, merely 35% successfully reproduce the original results. This work is the first to expose a significant gap between executability and reproducibility in software engineering replication packages and proposes three actionable guidelines to improve their reliability and utility.
📝 Abstract
Replication packages are crucial for enabling transparency, validation, and reuse in software engineering (SE) research. While artifact sharing is now a standard practice and even expected at premier SE venues such as ICSE, the practical usability of these replication packages remains underexplored. In particular, there is a marked lack of studies that comprehensively examine the executability and reproducibility of replication packages in SE research. In this paper, we aim to fill this gap by evaluating 100 replication packages published as part of ICSE proceedings over the past decade (2015--2024). We assess the (1) executability of the replication packages, (2) efforts and modifications required to execute them, (3) challenges that prevent executability, and (4) reproducibility of the original findings. We spent approximately 650 person-hours in total executing the artifacts and reproducing the study findings. Our findings reveal that only 40\% of the 100 evaluated artifacts were executable, of which 32.5\% (13 out of 40) ran without any modification. Regarding effort levels, 17.5\% (7 out of 40) required low effort, while 82.5\% (33 out of 40) required moderate to high effort to execute successfully. We identified five common types of modifications and 13 challenges leading to execution failure, spanning environmental, documentation, and structural issues. Among the executable artifacts, only 35\% (14 out of 40) reproduced the original results. These findings highlight a notable gap between artifact availability, executability, and reproducibility. Our study proposes three actionable guidelines to improve the preparation, documentation, and review of research artifacts, thereby strengthening the rigor and sustainability of open science practices in SE research.