Accounting for multiplicity in machine learning benchmark performance

📅 2023-03-10
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Current state-of-the-art (SOTA) evaluation in machine learning benchmarking selects the maximum test-set performance across multiple classifiers, inducing severe optimistic bias—effectively estimating sample-level extremes rather than the expected performance of the optimal classifier. Method: This paper introduces multiplicity correction to SOTA evaluation for the first time, proposing a unified framework integrating extreme-value statistics and probabilistic modeling to explicitly characterize how classifier dependence bounds estimation bias. Contribution/Results: Through theoretical analysis, simulation studies, and validation on real Kaggle competition data, our method substantially reduces SOTA estimation bias. We prove that classifier dependence has limited impact on bias in high-accuracy regimes. The corrected benchmark exhibits superior discriminative power and fairness compared to conventional approaches, establishing a new paradigm for reproducible and interpretable benchmark evaluation.
📝 Abstract
Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss three real-world examples; Kaggle competitions that demonstrate various aspects.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing expected best performance vs best classifier performance
Analyzing sample maximum estimator distributions for classifiers
Investigating practical consequences of current SOTA estimation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates expected performance of best classifier
Distinguishes sample maximum from best classifier
Analyzes non-identical dependent classifier distributions
🔎 Similar Papers
No similar papers found.
K
Kajsa Møllersen
Department of Community Medicine, UiT - The Arctic University of Norway
Einar Holsbø
Einar Holsbø
Department of Computer Science, UiT - The Arctic University of Norway