🤖 AI Summary
This study presents the first large-scale empirical analysis of Git history rewriting and its threats to software supply chain integrity and reproducibility. Addressing risks—including push conflicts, broken provenance, and malicious code injection—arising from history-altering operations (e.g., rebase, filter-branch) in public repositories, the authors analyze 111 million open-source projects archived by Software Heritage. Leveraging static analysis and two in-depth case studies, they propose the first evidence-driven taxonomy of Git history rewriting and develop GitHistorian, an automated detection tool. Applied at scale, the methodology identifies 1.22 million projects exhibiting history rewriting (8.7 million operations total), revealing prevalent legitimate use cases such as license updates and sensitive information removal. The work establishes a novel, scalable methodology for supply chain security assessment and delivers an open, extensible infrastructure for detecting and characterizing historical tampering.
📝 Abstract
Version Control Systems (VCS) like Git allow developers to locally rewrite recorded history, e.g., to reorder and suppress commits or specific data in them. These alterations have legitimate use cases, but become problematic when performed on public branches that have downstream users: they break push/pull workflows, challenge the integrity and reproducibility of repositories, and create opportunities for supply chain attackers to sneak into them nefarious changes. We conduct the first large-scale investigation of Git history alterations in public code repositories. We analyze 111 M (millions) repositories archived by Software Heritage, which preserves VCS histories even across alterations. We find history alterations in 1.22 M repositories, for a total of 8.7 M rewritten histories. We categorize changes by where they happen (which repositories, which branches) and what is changed in them (files or commit metadata). Conducting two targeted case studies we show that altered histories recurrently change licenses retroactively, or are used to remove ''secrets'' (e.g., private keys) committed by mistake. As these behaviors correspond to bad practices-in terms of project governance or security management, respectively-that software recipients might want to avoid, we introduce GitHistorian, an automated tool, that developers can use to spot and describe history alterations in public Git repositories.