🤖 AI Summary
This work addresses the frequent neglect of sampling strategy design and generalizability in software engineering research, which often undermines the representativeness of empirical findings. To remedy this, the paper introduces a domain-specific language (DSL) that explicitly models complex sampling workflows over code repositories through composable sampling operators, enabling—for the first time—formal specification and reasoning about the generalizability of sampling strategies. Implemented as a fluent Python API, the DSL is integrated with a statistical metric system to quantitatively assess the external validity of sampled datasets. The authors demonstrate the expressiveness and practical utility of their approach by reconstructing and formalizing the sampling procedures from multiple Mining Software Repositories (MSR) studies, thereby validating the framework’s capacity to capture real-world methodological diversity.
📝 Abstract
Empirical software engineering research often depends on datasets of code repository artifacts, where sampling strategies are employed to enable large-scale analyses. The design and evaluation of these strategies are critical, as they directly influence the generalizability of research findings. However, sampling remains an underestimated aspect in software engineering research: we identify two main challenges related to (1) the design and representativeness of sampling approaches, and (2) the ability to reason about the implications of sampling decisions on generalizability. To address these challenges, we propose a Domain-Specific Language (DSL) to explicitly describe complex sampling strategies through composable sampling operators. This formalism supports both the specification and the reasoning about the generalizability of results based on the applied sampling strategies. We implement the DSL as a Python-based fluent API, and demonstrate how it facilitates representativeness reasoning using statistical indicators extracted from sampling workflows. We validate our approach through a case study of MSR papers involving code repository sampling. Our results show that the DSL can model the sampling strategies reported in recent literature.