Near Optimal Inference for the Best-Performing Algorithm

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the problem of reliably identifying the smallest subset of machine learning algorithms most likely to contain the optimal algorithm on unseen datasets, given performance observations on a limited set of benchmark datasets—particularly when performance differences are marginal and high-confidence guarantees are required. We propose a novel subset selection framework grounded in multinomial statistical inference, which ensures that the true optimal algorithm lies within the selected subset with provable confidence. The method is computationally efficient both asymptotically and in finite-sample regimes, and we provide the first proof establishing that its sample complexity matches the information-theoretic lower bound. Theoretical analysis and empirical evaluation demonstrate that our approach significantly improves the joint trade-off between subset minimality and confidence level compared to existing methods, enabling more accurate and compact identification of the optimal algorithm candidates.

Technology Category

Application Category

📝 Abstract

Consider a collection of competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to rank highest on a future, unseen dataset. A natural approach is to select the algorithm that demonstrates the best performance on the benchmark. However, in many cases the performance differences are marginal and additional candidates may also be considered. This problem is formulated as subset selection for multinomial distributions. Formally, given a sample from a countable alphabet, our goal is to identify a minimal subset of symbols that includes the most frequent symbol in the population with high confidence. In this work, we introduce a novel framework for the subset selection problem. We provide both asymptotic and finite-sample schemes that significantly improve upon currently known methods. In addition, we provide matching lower bounds, demonstrating the favorable performance of our proposed schemes.

Problem

Research questions and friction points this paper is trying to address.

Identify best-performing algorithm from benchmark datasets

Select minimal subset including top algorithm confidently

Improve subset selection methods with theoretical bounds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Subset selection for multinomial distributions

Asymptotic and finite-sample schemes

Matching lower bounds for performance

🔎 Similar Papers

Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing