🤖 AI Summary
Open-source software development within academic institutions—such as universities and research laboratories—is highly fragmented and lacks visibility, hindering the identification, attribution, and impact assessment of scholarly tools.
Method: We propose a scalable open-source ecosystem observability framework that integrates metadata extraction, conventional machine learning, and large language models to construct an end-to-end institutional attribution pipeline; automated data collection and classification are implemented via the GitHub REST API.
Contribution/Results: This is the first systematic effort to identify and analyze institution-affiliated repositories across a university system. Evaluated on the University of California system, our approach successfully identified over 52,000 institution-associated repositories, significantly improving both attribution accuracy and operational efficiency. The framework provides a reproducible methodology and empirical foundation for academic open-source governance, impact evaluation, and strategic resource allocation.
📝 Abstract
Open source software development, particularly within institutions such as universities and research laboratories, is often decentralized and difficult to track. Despite producing highly impactful tools in science, these efforts often go unrecognized due to a lack of visibility and institutional awareness. This paper addresses the challenge of discovering, classifying, and analyzing open source software projects developed across distributed institutional systems. We present a framework for systematically identifying institutional affiliated repositories, using the University of California (UC) system as a case study.
Using GitHub's REST API, we build a pipeline to discover relevant repositories and extract meaningful metadata. We then propose and evaluate multiple classification strategies, including both traditional machine learning models and large language models (LLMs), to distinguish affiliated projects from unrelated repositories and generate accurate insights into the academic open source landscape. Our results show that the framework is effective at scale, discovering over 52,000 repositories and predicting institutional affiliation with high accuracy.