Altered Histories in Version Control System Repositories: Evidence from the Trenches

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study presents the first large-scale empirical analysis of Git history rewriting and its threats to software supply chain integrity and reproducibility. Addressing risks—including push conflicts, broken provenance, and malicious code injection—arising from history-altering operations (e.g., rebase, filter-branch) in public repositories, the authors analyze 111 million open-source projects archived by Software Heritage. Leveraging static analysis and two in-depth case studies, they propose the first evidence-driven taxonomy of Git history rewriting and develop GitHistorian, an automated detection tool. Applied at scale, the methodology identifies 1.22 million projects exhibiting history rewriting (8.7 million operations total), revealing prevalent legitimate use cases such as license updates and sensitive information removal. The work establishes a novel, scalable methodology for supply chain security assessment and delivers an open, extensible infrastructure for detecting and characterizing historical tampering.

Technology Category

Application Category

📝 Abstract
Version Control Systems (VCS) like Git allow developers to locally rewrite recorded history, e.g., to reorder and suppress commits or specific data in them. These alterations have legitimate use cases, but become problematic when performed on public branches that have downstream users: they break push/pull workflows, challenge the integrity and reproducibility of repositories, and create opportunities for supply chain attackers to sneak into them nefarious changes. We conduct the first large-scale investigation of Git history alterations in public code repositories. We analyze 111 M (millions) repositories archived by Software Heritage, which preserves VCS histories even across alterations. We find history alterations in 1.22 M repositories, for a total of 8.7 M rewritten histories. We categorize changes by where they happen (which repositories, which branches) and what is changed in them (files or commit metadata). Conducting two targeted case studies we show that altered histories recurrently change licenses retroactively, or are used to remove ''secrets'' (e.g., private keys) committed by mistake. As these behaviors correspond to bad practices-in terms of project governance or security management, respectively-that software recipients might want to avoid, we introduce GitHistorian, an automated tool, that developers can use to spot and describe history alterations in public Git repositories.
Problem

Research questions and friction points this paper is trying to address.

Investigating Git history alterations in public repositories
Analyzing impacts on repository integrity and reproducibility
Identifying security risks from rewritten commit histories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed 111 million repositories for alterations
Introduced GitHistorian automated detection tool
Categorized changes by repository branch metadata
🔎 Similar Papers
No similar papers found.
S
Solal Rapaport
LTCI, Telecom Paris, Institut Polytechnique de Paris, Palaiseau, France
L
Laurent Pautet
LTCI, Telecom Paris, Institut Polytechnique de Paris, Palaiseau, France
Samuel Tardieu
Samuel Tardieu
Télécom Paris
Stefano Zacchiroli
Stefano Zacchiroli
LTCI, Télécom Paris, Polytechnique Institute of Paris, France
software engineeringopen source softwaredigital commonscomputer security