Besting Good--Turing: Optimality of Non-Parametric Maximum Likelihood for Distribution Estimation

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale discrete distribution estimation under small-sample regimes, the Good-Turing estimator suffers from reliance on manually tuned smoothing parameters and lacks theoretical guarantees of optimality. Method: We propose a fully automatic, smoothing-parameter-free empirical Bayes method that integrates the Robbins empirical Bayes framework with the Kiefer–Wolfowitz nonparametric maximum likelihood estimator (NPMLE). Using the EM algorithm, it implicitly learns the prior distribution from data and derives probability estimates. Contribution/Results: We establish, for the first time within the Orlitsky–Suresh instance-optimality framework, that our estimator achieves logarithmic-factor minimax optimality—strictly outperforming Good-Turing. Experiments on synthetic data, English corpora, and U.S. Census data demonstrate consistent and significant improvements over Good-Turing and explicit Bayesian methods, validating both theoretical rigor and practical efficacy.

Technology Category

Application Category

📝 Abstract
When faced with a small sample from a large universe of possible outcomes, scientists often turn to the venerable Good--Turing estimator. Despite its pedigree, however, this estimator comes with considerable drawbacks, such as the need to hand-tune smoothing parameters and the lack of a precise optimality guarantee. We introduce a parameter-free estimator that bests Good--Turing in both theory and practice. Our method marries two classic ideas, namely Robbins's empirical Bayes and Kiefer--Wolfowitz non-parametric maximum likelihood estimation (NPMLE), to learn an implicit prior from data and then convert it into probability estimates. We prove that the resulting estimator attains the optimal instance-wise risk up to logarithmic factors in the competitive framework of Orlitsky and Suresh, and that the Good--Turing estimator is strictly suboptimal in the same framework. Our simulations on synthetic data and experiments with English corpora and U.S. Census data show that our estimator consistently outperforms both the Good--Turing estimator and explicit Bayes procedures.
Problem

Research questions and friction points this paper is trying to address.

Estimating large universe distributions from small samples
Overcoming Good-Turing's parameter tuning and suboptimality
Developing parameter-free optimal estimator via empirical Bayes and NPMLE
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-parametric maximum likelihood estimation (NPMLE) method
Empirical Bayes approach for implicit prior learning
Parameter-free estimator outperforming Good-Turing
🔎 Similar Papers
No similar papers found.
Yanjun Han
Yanjun Han
Assistant Professor, New York University
statisticslearning theoryinformation theory
Jonathan Niles-Weed
Jonathan Niles-Weed
New York University
statisticsprobabilitymathematics of data scienceoptimal transport
Y
Yandi Shen
Department of Statistics and Data Science, Carnegie Mellon University
Y
Yihong Wu
Department of Statistics and Data Science, Yale University