Recipe for Discovery: A Framework for Systematic Open Source Project Identification

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-source software development within academic institutions—such as universities and research laboratories—is highly fragmented and lacks visibility, hindering the identification, attribution, and impact assessment of scholarly tools. Method: We propose a scalable open-source ecosystem observability framework that integrates metadata extraction, conventional machine learning, and large language models to construct an end-to-end institutional attribution pipeline; automated data collection and classification are implemented via the GitHub REST API. Contribution/Results: This is the first systematic effort to identify and analyze institution-affiliated repositories across a university system. Evaluated on the University of California system, our approach successfully identified over 52,000 institution-associated repositories, significantly improving both attribution accuracy and operational efficiency. The framework provides a reproducible methodology and empirical foundation for academic open-source governance, impact evaluation, and strategic resource allocation.

Technology Category

Application Category

📝 Abstract
Open source software development, particularly within institutions such as universities and research laboratories, is often decentralized and difficult to track. Despite producing highly impactful tools in science, these efforts often go unrecognized due to a lack of visibility and institutional awareness. This paper addresses the challenge of discovering, classifying, and analyzing open source software projects developed across distributed institutional systems. We present a framework for systematically identifying institutional affiliated repositories, using the University of California (UC) system as a case study. Using GitHub's REST API, we build a pipeline to discover relevant repositories and extract meaningful metadata. We then propose and evaluate multiple classification strategies, including both traditional machine learning models and large language models (LLMs), to distinguish affiliated projects from unrelated repositories and generate accurate insights into the academic open source landscape. Our results show that the framework is effective at scale, discovering over 52,000 repositories and predicting institutional affiliation with high accuracy.
Problem

Research questions and friction points this paper is trying to address.

Identifying decentralized open source projects in institutions
Classifying institutional-affiliated repositories accurately
Analyzing academic open source projects systematically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for systematic open source project identification
Pipeline using GitHub REST API for repository discovery
Classification with machine learning and LLMs for affiliation
🔎 Similar Papers
No similar papers found.