🤖 AI Summary
The generalizability of machine learning experimental results—i.e., consistency across varying conditions—has long lacked rigorous, quantitative assessment due to the absence of mathematical formalization of experimental procedures.
Method: This paper introduces the first principled mathematical modeling framework for ML experiments, treating them as stochastic processes and defining computable, reproducible generalizability metrics. The approach integrates probabilistic modeling, statistical inference, and experimental design theory to enable both diagnostic analysis of generalizability and estimation of minimal required sample sizes.
Contribution/Results: Applied to ImageNet and GLUE benchmarks, the framework successfully identifies generalizability boundaries for several widely cited conclusions. A fully open-source Python toolkit implements end-to-end reproducibility and supports community-driven extensions. This work establishes the first verifiable, quantitative foundation for scientific rigor in ML experimentation.
📝 Abstract
Experimental studies are a cornerstone of machine learning (ML) research. A common, but often implicit, assumption is that the results of a study will generalize beyond the study itself, e.g. to new data. That is, there is a high probability that repeating the study under different conditions will yield similar results. Despite the importance of the concept, the problem of measuring generalizability remains open. This is probably due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization and develop a quantifiable notion of generalizability. This notion allows to explore the generalizability of existing studies and to estimate the number of experiments needed to achieve the generalizability of new studies. To demonstrate its usefulness, we apply it to two recently published benchmarks to discern generalizable and non-generalizable results. We also publish a Python module that allows our analysis to be repeated for other experimental studies.