Modeling Sampling Workflows for Code Repositories

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the frequent neglect of sampling strategy design and generalizability in software engineering research, which often undermines the representativeness of empirical findings. To remedy this, the paper introduces a domain-specific language (DSL) that explicitly models complex sampling workflows over code repositories through composable sampling operators, enabling—for the first time—formal specification and reasoning about the generalizability of sampling strategies. Implemented as a fluent Python API, the DSL is integrated with a statistical metric system to quantitatively assess the external validity of sampled datasets. The authors demonstrate the expressiveness and practical utility of their approach by reconstructing and formalizing the sampling procedures from multiple Mining Software Repositories (MSR) studies, thereby validating the framework’s capacity to capture real-world methodological diversity.

Technology Category

Application Category

📝 Abstract
Empirical software engineering research often depends on datasets of code repository artifacts, where sampling strategies are employed to enable large-scale analyses. The design and evaluation of these strategies are critical, as they directly influence the generalizability of research findings. However, sampling remains an underestimated aspect in software engineering research: we identify two main challenges related to (1) the design and representativeness of sampling approaches, and (2) the ability to reason about the implications of sampling decisions on generalizability. To address these challenges, we propose a Domain-Specific Language (DSL) to explicitly describe complex sampling strategies through composable sampling operators. This formalism supports both the specification and the reasoning about the generalizability of results based on the applied sampling strategies. We implement the DSL as a Python-based fluent API, and demonstrate how it facilitates representativeness reasoning using statistical indicators extracted from sampling workflows. We validate our approach through a case study of MSR papers involving code repository sampling. Our results show that the DSL can model the sampling strategies reported in recent literature.
Problem

Research questions and friction points this paper is trying to address.

sampling
code repositories
generalizability
empirical software engineering
representativeness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-Specific Language
Sampling Strategy
Code Repository
Generalizability
Fluent API
🔎 Similar Papers
No similar papers found.
R
Romain Lefeuvre
University of Rennes, Inria, CNRS, IRISA
M
Maïwenn Le Goasteller
University of Rennes, Inria, CNRS, IRISA
J
Jessie Galasso
McGill University
B
Benoît Combemale
Inria, University of Rennes, CNRS, IRISA
Q
Quentin Perez
INSA Rennes, University of Rennes, Inria, CNRS, IRISA
Houari Sahraoui
Houari Sahraoui
Professor of Computer Science, Université de Montréal
Software EngineeringArtificial IntelligenceAutomated software engineeringMDESBSE