🤖 AI Summary
This paper challenges the reliance of machine learning research in the social sciences on abstract data-generating distributions, arguing that such assumptions lack empirical grounding in finite-population settings and engender interpretability and reproducibility issues. Method: The authors advocate replacing distributional assumptions with finite-population modeling, systematically advancing five core arguments grounded in statistical foundations, philosophical epistemology, and ML empirical analysis. They reconstruct the premises of learning theory by explicitly identifying the implicit assumptions and boundary conditions underlying distributional modeling. Contribution/Results: The proposed framework enhances theoretical coherence, modeling transparency, causal traceability, and practical applicability. It provides a novel paradigm and methodological foundation for sampling design, bias correction in evaluation, and reproducibility research—thereby addressing critical limitations of conventional distribution-based approaches in social-science ML applications.
📝 Abstract
Machine Learning research, as most of Statistics, heavily relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are `sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which, it is presumed, are also drawn from it. Drawing on scholarship across disciplines, we here argue that this framework is not always a good model. Not only do such true probability distributions not exist; the framework can also be misleading and obscure both the choices made and the goals pursued in machine learning practice. We suggest an alternative framework that focuses on finite populations rather than abstract distributions; while classical learning theory can be left almost unchanged, it opens new opportunities, especially to model sampling. We compile these considerations into five reasons for modelling machine learning -- in some settings -- with finite populations rather than generative distributions, both to be more faithful to practice and to provide novel theoretical insights.