🤖 AI Summary
This study addresses the lack of consensus in defining and delineating research software, which hinders systematic analysis and cross-study comparison of security risks in the Research Software Supply Chain (RSSC). To bridge this gap, the work proposes the first explicit classification framework tailored to RSSC. Through a scoping review, it derives an operational definition of research software and constructs a unified taxonomy. The authors implement a reproducible annotation pipeline, codebook, and annotated dataset on the RSE corpus. By integrating OpenSSF Scorecard security signals, they demonstrate statistically significant differences in security metrics across classification clusters, thereby validating that classification-aware, stratified assessment is both necessary and practical for advancing RSSC security research.
📝 Abstract
Empirical studies of research software are hard to compare because the literature operationalizes ``research software''inconsistently. Motivated by the research software supply chain (RSSC) and its security risks, we introduce an RSSC-oriented taxonomy that makes scope and operational boundaries explicit for empirical research software security studies. We conduct a targeted scoping review of recent repository mining and dataset construction studies, extracting each work's definition, inclusion criteria, unit of analysis, and identification heuristics. We synthesize these into a harmonized taxonomy and a mapping that translates prior approaches into shared taxonomy dimensions. We operationalize the taxonomy on a large community-curated corpus from the Research Software Encyclopedia (RSE), producing an annotated dataset, a labeling codebook, and a reproducible labeling pipeline. Finally, we apply OpenSSF Scorecard as a preliminary security analysis to show how repository-centric security signals differ across taxonomy-defined clusters and why taxonomy-aware stratification is necessary for interpreting RSSC security measurements.