🤖 AI Summary
This work addresses the frequent failures in continuous integration (CI) builds of embedded open-source software, which often stem from cross-compilation complexities, board-specific configurations, and toolchain constraints. These failures are compounded by heterogeneous, ephemeral build logs that are difficult to reuse. To tackle this challenge, the authors propose PhantomRun, a framework that enables standardized reproduction of historical failed builds through a build log abstraction layer, metadata standardization, containerized replay environments, and heterogeneous log parsing techniques. PhantomRun is the first system to support large-scale, controllable replay of failed embedded CI builds, offering a unified, machine-readable interface for build artifacts and metadata. Evaluated on 4,628 failed runs, PhantomRun successfully reconstructed 91.8% of the builds, with 98% preserving the original execution outcomes, demonstrating high reproducibility fidelity.
📝 Abstract
Due to hardware-software co-development in embedded systems, continuous integration (CI) builds frequently fail because of complex cross-compilation, board configurations, and toolchain constraints. Although CI build logs contain valuable diagnostic information, they are short-lived and difficult to reuse due to heterogeneous runners, toolchains, and log formats. To address these challenges, we present PhantomRun, a unified abstraction layer and publicly reusable dataset that standardizes the retrieval, storage, and reproduction of CI build logs and metadata. Across 4628 failing CI runs, we reconstructed 91.8% of builds and preserved execution outcomes in 98% of evaluated cases.
PhantomRun provides two core capabilities: retrieving the build log of any commit and faithfully re-executing the corresponding build in a controlled environment. By exposing all build artifacts and metadata in a uniform, machine-readable format, PhantomRun enables reproducible and longitudinal studies of CI failures. An empirical evaluation shows that reproduced builds closely match their originals, typically differing only in timestamps or minor nondeterministic reordering, demonstrating the feasibility of large-scale historical CI reconstruction.