Near Optimal Inference for the Best-Performing Algorithm

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of reliably identifying the smallest subset of machine learning algorithms most likely to contain the optimal algorithm on unseen datasets, given performance observations on a limited set of benchmark datasets—particularly when performance differences are marginal and high-confidence guarantees are required. We propose a novel subset selection framework grounded in multinomial statistical inference, which ensures that the true optimal algorithm lies within the selected subset with provable confidence. The method is computationally efficient both asymptotically and in finite-sample regimes, and we provide the first proof establishing that its sample complexity matches the information-theoretic lower bound. Theoretical analysis and empirical evaluation demonstrate that our approach significantly improves the joint trade-off between subset minimality and confidence level compared to existing methods, enabling more accurate and compact identification of the optimal algorithm candidates.

Technology Category

Application Category

📝 Abstract
Consider a collection of competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to rank highest on a future, unseen dataset. A natural approach is to select the algorithm that demonstrates the best performance on the benchmark. However, in many cases the performance differences are marginal and additional candidates may also be considered. This problem is formulated as subset selection for multinomial distributions. Formally, given a sample from a countable alphabet, our goal is to identify a minimal subset of symbols that includes the most frequent symbol in the population with high confidence. In this work, we introduce a novel framework for the subset selection problem. We provide both asymptotic and finite-sample schemes that significantly improve upon currently known methods. In addition, we provide matching lower bounds, demonstrating the favorable performance of our proposed schemes.
Problem

Research questions and friction points this paper is trying to address.

Identify best-performing algorithm from benchmark datasets
Select minimal subset including top algorithm confidently
Improve subset selection methods with theoretical bounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Subset selection for multinomial distributions
Asymptotic and finite-sample schemes
Matching lower bounds for performance
🔎 Similar Papers
No similar papers found.