🤖 AI Summary
In large-scale discrete distribution estimation under small-sample regimes, the Good-Turing estimator suffers from reliance on manually tuned smoothing parameters and lacks theoretical guarantees of optimality.
Method: We propose a fully automatic, smoothing-parameter-free empirical Bayes method that integrates the Robbins empirical Bayes framework with the Kiefer–Wolfowitz nonparametric maximum likelihood estimator (NPMLE). Using the EM algorithm, it implicitly learns the prior distribution from data and derives probability estimates.
Contribution/Results: We establish, for the first time within the Orlitsky–Suresh instance-optimality framework, that our estimator achieves logarithmic-factor minimax optimality—strictly outperforming Good-Turing. Experiments on synthetic data, English corpora, and U.S. Census data demonstrate consistent and significant improvements over Good-Turing and explicit Bayesian methods, validating both theoretical rigor and practical efficacy.
📝 Abstract
When faced with a small sample from a large universe of possible outcomes, scientists often turn to the venerable Good--Turing estimator. Despite its pedigree, however, this estimator comes with considerable drawbacks, such as the need to hand-tune smoothing parameters and the lack of a precise optimality guarantee. We introduce a parameter-free estimator that bests Good--Turing in both theory and practice. Our method marries two classic ideas, namely Robbins's empirical Bayes and Kiefer--Wolfowitz non-parametric maximum likelihood estimation (NPMLE), to learn an implicit prior from data and then convert it into probability estimates. We prove that the resulting estimator attains the optimal instance-wise risk up to logarithmic factors in the competitive framework of Orlitsky and Suresh, and that the Good--Turing estimator is strictly suboptimal in the same framework. Our simulations on synthetic data and experiments with English corpora and U.S. Census data show that our estimator consistently outperforms both the Good--Turing estimator and explicit Bayes procedures.