🤖 AI Summary
This study addresses the lack of systematic understanding regarding the content and long-term evolution of GitHub repositories. It presents the first large-scale empirical analysis of 10,000 real-world open-source repositories, combining static content parsing with time-series modeling to trace the evolution of files, directories, and file extensions over the past decade. The findings reveal that README.md, .gitignore, and LICENSE have become standard components; CI/CD tooling has shifted from diversity toward dominance by GitHub Actions; configuration formats exhibit a clear rise of TOML, YAML, and JSON alongside the decline of XML; and Dockerfiles as well as LLM-related files (e.g., AGENTS.md) have grown significantly. This work provides quantitative evidence for understanding technological shifts and standardization processes in the open-source ecosystem.
📝 Abstract
GitHub is the largest code hosting platform, with millions of repositories spanning multiple technologies. Despite this, little is known about the actual contents of GitHub's repositories in the wild. This paper presents an initial empirical analysis to better understand the contents of real-world GitHub repositories. We analyze the files, directories, and extensions present in 10,000 GitHub repositories, as well as their evolution over ten years. Our results show major changes in GitHub over the last decade: (1) the consolidation of README.md, .gitignore, and LICENSE as standard artifacts; (2) the rise of GitHub Actions as the dominant CI/CD platform; (3) the growth of configuration formats such as TOML, YAML, and JSON, alongside a decline in XML; (4) new trends, such as the growth of Dockerfile; and (5) emerging content related to LLMs and generative AI (e.g., AGENTS.md). Based on our findings, we discuss implications, including that open source is not only evolving organically but also increasingly guided by GitHub's standards, the rise and fall of technologies, and the potential support for mining software repository studies.