🤖 AI Summary
This study addresses the fragmentation of research software and its associated scholarly resources—such as publications and datasets—across disparate platforms, which hinders reproducibility and cross-domain analysis due to a lack of unified semantic links. To bridge this gap, the authors construct a large-scale RDF knowledge graph comprising 81 million triples, integrating approximately 200,000 GitHub repositories with external academic knowledge graphs including SemOpenAlex, LPWC, and MLSea-KG. This integration enables unified semantic modeling of software alongside scholarly entities such as authors, papers, and datasets. The resulting knowledge graph supports cross-platform provenance tracing and complex semantic queries, significantly enhancing the capacity to assess software reproducibility and analyze its long-term sustainability within the scientific ecosystem.
📝 Abstract
We present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.