Which distribution were you sampled from? Towards a more tangible conception of data

📅 2024-07-24

📈 Citations: 1

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This paper challenges the reliance of machine learning research in the social sciences on abstract data-generating distributions, arguing that such assumptions lack empirical grounding in finite-population settings and engender interpretability and reproducibility issues. Method: The authors advocate replacing distributional assumptions with finite-population modeling, systematically advancing five core arguments grounded in statistical foundations, philosophical epistemology, and ML empirical analysis. They reconstruct the premises of learning theory by explicitly identifying the implicit assumptions and boundary conditions underlying distributional modeling. Contribution/Results: The proposed framework enhances theoretical coherence, modeling transparency, causal traceability, and practical applicability. It provides a novel paradigm and methodological foundation for sampling design, bias correction in evaluation, and reproducibility research—thereby addressing critical limitations of conventional distribution-based approaches in social-science ML applications.

Technology Category

Application Category

📝 Abstract

Machine Learning research, as most of Statistics, heavily relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are `sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which, it is presumed, are also drawn from it. Drawing on scholarship across disciplines, we here argue that this framework is not always a good model. Not only do such true probability distributions not exist; the framework can also be misleading and obscure both the choices made and the goals pursued in machine learning practice. We suggest an alternative framework that focuses on finite populations rather than abstract distributions; while classical learning theory can be left almost unchanged, it opens new opportunities, especially to model sampling. We compile these considerations into five reasons for modelling machine learning -- in some settings -- with finite populations rather than generative distributions, both to be more faithful to practice and to provide novel theoretical insights.

Problem

Research questions and friction points this paper is trying to address.

Avoid assuming data-generating probability distributions in social ML

Challenge reliance on abstract distributions for fairness in algorithms

Propose alternative frameworks focusing on populations not distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Avoids data-generating probability distribution assumptions

Focuses on relevant populations directly

Maintains classical learning theory framework

🔎 Similar Papers

No similar papers found.