🤖 AI Summary
This paper addresses the “unseen species” problem by developing Bayesian nonparametric interval estimation for the number $K_{n,m}$ of new species in large-scale sampling. We propose a Gaussian approximation to the posterior distribution of $K_{n,m}$ under the Pitman–Yor process prior (encompassing the Dirichlet process as a special case), and establish, for the first time, a posterior central limit theorem for $K_{n,m}$—without requiring MCMC sampling. This yields closed-form credible intervals that significantly narrow the gap between asymptotic and exact coverage on both synthetic and real-world data, achieving a favorable trade-off between coverage probability and interval width, while accelerating computation by orders of magnitude. Key contributions include: (i) a unified asymptotic normality theory for $K_{n,m}$ under the fully parametrized Pitman–Yor prior; (ii) a sampling-free framework for constructing credible intervals; and (iii) a scalable, computationally efficient inference procedure tailored to big-data settings.
📝 Abstract
The unseen-species problem assumes $ngeq1$ samples from a population of individuals belonging to different species, possibly infinite, and calls for estimating the number $K_{n,m}$ of hitherto unseen species that would be observed if $mgeq1$ new samples were collected from the same population. This is a long-standing problem in statistics, which has gained renewed relevance in biological and physical sciences, particularly in settings with large values of $n$ and $m$. In this paper, we adopt a Bayesian nonparametric approach to the unseen-species problem under the Pitman-Yor prior, and propose a novel methodology to derive large $m$ asymptotic credible intervals for $K_{n,m}$, for any $ngeq1$. By leveraging a Gaussian central limit theorem for the posterior distribution of $K_{n,m}$, our method improves upon competitors in two key aspects: firstly, it enables the full parameterization of the Pitman-Yor prior, including the Dirichlet prior; secondly, it avoids the need of Monte Carlo sampling, enhancing computational efficiency. We validate the proposed method on synthetic and real data, demonstrating that it improves the empirical performance of competitors by significantly narrowing the gap between asymptotic and exact credible intervals for any $mgeq1$.