🤖 AI Summary
Although Docker is widely assumed to ensure reproducibility of software environments, its practical efficacy remains insufficiently validated. This study presents the first systematic investigation combining a literature review with large-scale empirical analysis of 5,298 real-world GitHub projects. By reconstructing Docker images, performing differential comparisons, and mining workflow patterns, we quantitatively assess the reproducibility of Docker builds and the effectiveness of recommended best practices. Our findings reveal that a significant proportion of Docker builds are not reproducible, and existing best practices offer limited improvements in practice. These results challenge the prevailing assumption that “containers guarantee reproducibility” and provide empirical evidence and actionable insights for enhancing reproducibility in computational research.
📝 Abstract
The reproducibility of software environments is a critical concern in modern software engineering, with ramifications ranging from the effectiveness of collaboration workflows to software supply chain security and scientific reproducibility. Containerization technologies like Docker address this problem by encapsulating software environments into shareable filesystem snapshots known as images. While Docker is frequently cited in the literature as a tool that enables reproducibility in theory, the extent of its guarantees and limitations in practice remains under-explored. In this work, we address this gap through two complementary approaches. First, we conduct a systematic literature review to examine how Docker is framed in scientific discourse on reproducibility and to identify documented best practices for writing Dockerfiles enabling reproducible image building. Then, we perform a large-scale empirical study of 5298 Docker builds collected from GitHub workflows. By rebuilding these images and comparing the results with their historical counterparts, we assess the real reproducibility of Docker images and evaluate the effectiveness of the best practices identified in the literature.