Besting Good--Turing: Optimality of Non-Parametric Maximum Likelihood for Distribution Estimation

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

In large-scale discrete distribution estimation under small-sample regimes, the Good-Turing estimator suffers from reliance on manually tuned smoothing parameters and lacks theoretical guarantees of optimality. Method: We propose a fully automatic, smoothing-parameter-free empirical Bayes method that integrates the Robbins empirical Bayes framework with the Kiefer–Wolfowitz nonparametric maximum likelihood estimator (NPMLE). Using the EM algorithm, it implicitly learns the prior distribution from data and derives probability estimates. Contribution/Results: We establish, for the first time within the Orlitsky–Suresh instance-optimality framework, that our estimator achieves logarithmic-factor minimax optimality—strictly outperforming Good-Turing. Experiments on synthetic data, English corpora, and U.S. Census data demonstrate consistent and significant improvements over Good-Turing and explicit Bayesian methods, validating both theoretical rigor and practical efficacy.

Technology Category

Application Category

📝 Abstract

When faced with a small sample from a large universe of possible outcomes, scientists often turn to the venerable Good--Turing estimator. Despite its pedigree, however, this estimator comes with considerable drawbacks, such as the need to hand-tune smoothing parameters and the lack of a precise optimality guarantee. We introduce a parameter-free estimator that bests Good--Turing in both theory and practice. Our method marries two classic ideas, namely Robbins's empirical Bayes and Kiefer--Wolfowitz non-parametric maximum likelihood estimation (NPMLE), to learn an implicit prior from data and then convert it into probability estimates. We prove that the resulting estimator attains the optimal instance-wise risk up to logarithmic factors in the competitive framework of Orlitsky and Suresh, and that the Good--Turing estimator is strictly suboptimal in the same framework. Our simulations on synthetic data and experiments with English corpora and U.S. Census data show that our estimator consistently outperforms both the Good--Turing estimator and explicit Bayes procedures.

Problem

Research questions and friction points this paper is trying to address.

Estimating large universe distributions from small samples

Overcoming Good-Turing's parameter tuning and suboptimality

Developing parameter-free optimal estimator via empirical Bayes and NPMLE

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-parametric maximum likelihood estimation (NPMLE) method

Empirical Bayes approach for implicit prior learning

Parameter-free estimator outperforming Good-Turing

🔎 Similar Papers

No similar papers found.