OpenDORS: A dataset of openly referenced open research software

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Empirical research on scholarly software lacks large-scale, evidence-based foundations. Method: We constructed the largest literature-linked open-source research software dataset to date, comprising 134,352 distinct projects and 134,154 source code repositories, along with their citations in open-access publications. By systematically integrating metadata from open publishing platforms and code hosting services, we extracted structured information—including version history, licenses, programming languages, and functional descriptions—enabling the first fine-grained mapping between research software and its associated scholarly outputs. Contribution/Results: The publicly released dataset includes complete metadata for over 120,000 projects, substantially addressing the scarcity of high-quality empirical data in research software engineering (RSE). It provides a reproducible foundation for assessing software impact, analyzing development practices, and informing evidence-based policy formulation in scholarly software infrastructure.

Technology Category

Application Category

📝 Abstract
In many academic disciplines, software is created during the research process or for a research purpose. The crucial role of software for research is increasingly acknowledged. The application of software engineering to research software has been formalized as research software engineering, to create better software that enables better research. Despite this, large-scale studies of research software and its development are still lacking. To enable such studies, we present a dataset of 134,352 unique open research software projects and 134,154 source code repositories referenced in open access literature. Each dataset record identifies the referencing publication and lists source code repositories of the software project. For 122,425 source code repositories, the dataset provides metadata on latest versions, license information, programming languages and descriptive metadata files. We summarize the distributions of these features in the dataset and describe additional software metadata that extends the dataset in future work. Finally, we suggest examples of research that could use the dataset to develop a better understanding of research software practice in RSE research.
Problem

Research questions and friction points this paper is trying to address.

Creating a dataset of open research software projects referenced in academic literature.
Providing metadata on software repositories for large-scale studies of research software.
Enabling research on software engineering practices in academic software development.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset of open research software projects
Metadata on repositories and publications
Enables large-scale studies of software practices
🔎 Similar Papers
No similar papers found.