Canonicalization for Unreproducible Builds in Java

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Non-reproducibility in Java software builds severely undermines supply-chain security and artifact integrity verification. To address this, we systematically identify and classify six categories of non-deterministic root causes, establishing the first large-scale taxonomy of reproducibility failures in Java. We propose a two-tier normalization framework: artifact-level standardization via OSS-Rebuild and bytecode-level canonicalization via jNorm. Building upon these, we design Chains-Rebuild—a fully automated, end-to-end reproducibility-enhancement toolchain—and release the first large-scale dataset of non-reproducible Java builds. Experimental evaluation on 12,283 non-reproducible Java artifacts demonstrates that our approach increases the reproducibility rate from 9.48% to 26.89%. This work establishes a verifiable, reproducible software delivery paradigm for the Java ecosystem.

Technology Category

Application Category

📝 Abstract
The increasing complexity of software supply chains and the rise of supply chain attacks have elevated concerns around software integrity. Users and stakeholders face significant challenges in validating that a given software artifact corresponds to its declared source. Reproducible Builds address this challenge by ensuring that independently performed builds from identical source code produce identical binaries. However, achieving reproducibility at scale remains difficult, especially in Java, due to a range of non-deterministic factors and caveats in the build process. In this work, we focus on reproducibility in Java-based software, archetypal of enterprise applications. We introduce a conceptual framework for reproducible builds, we analyze a large dataset from Reproducible Central, and we develop a novel taxonomy of six root causes of unreproducibility. We study actionable mitigations: artifact and bytecode canonicalization using OSS-Rebuild and jNorm respectively. Finally, we present Chains-Rebuild, a tool that raises reproducibility success from 9.48% to 26.89% on 12,283 unreproducible artifacts. To sum up, our contributions are the first large-scale taxonomy of build unreproducibility causes in Java, a publicly available dataset of unreproducible builds, and Chains-Rebuild, a canonicalization tool for mitigating unreproducible builds in Java.
Problem

Research questions and friction points this paper is trying to address.

Addressing unreproducible builds in Java software
Identifying root causes of build non-determinism
Improving reproducibility via canonicalization tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conceptual framework for reproducible Java builds
Artifact and bytecode canonicalization tools
Chains-Rebuild tool improves reproducibility success
🔎 Similar Papers
No similar papers found.