Gaussian credible intervals in Bayesian nonparametric estimation of the unseen

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This paper addresses the “unseen species” problem by developing Bayesian nonparametric interval estimation for the number $K_{n,m}$ of new species in large-scale sampling. We propose a Gaussian approximation to the posterior distribution of $K_{n,m}$ under the Pitman–Yor process prior (encompassing the Dirichlet process as a special case), and establish, for the first time, a posterior central limit theorem for $K_{n,m}$—without requiring MCMC sampling. This yields closed-form credible intervals that significantly narrow the gap between asymptotic and exact coverage on both synthetic and real-world data, achieving a favorable trade-off between coverage probability and interval width, while accelerating computation by orders of magnitude. Key contributions include: (i) a unified asymptotic normality theory for $K_{n,m}$ under the fully parametrized Pitman–Yor prior; (ii) a sampling-free framework for constructing credible intervals; and (iii) a scalable, computationally efficient inference procedure tailored to big-data settings.

Technology Category

Application Category

📝 Abstract

The unseen-species problem assumes $ngeq1$ samples from a population of individuals belonging to different species, possibly infinite, and calls for estimating the number $K_{n,m}$ of hitherto unseen species that would be observed if $mgeq1$ new samples were collected from the same population. This is a long-standing problem in statistics, which has gained renewed relevance in biological and physical sciences, particularly in settings with large values of $n$ and $m$. In this paper, we adopt a Bayesian nonparametric approach to the unseen-species problem under the Pitman-Yor prior, and propose a novel methodology to derive large $m$ asymptotic credible intervals for $K_{n,m}$, for any $ngeq1$. By leveraging a Gaussian central limit theorem for the posterior distribution of $K_{n,m}$, our method improves upon competitors in two key aspects: firstly, it enables the full parameterization of the Pitman-Yor prior, including the Dirichlet prior; secondly, it avoids the need of Monte Carlo sampling, enhancing computational efficiency. We validate the proposed method on synthetic and real data, demonstrating that it improves the empirical performance of competitors by significantly narrowing the gap between asymptotic and exact credible intervals for any $mgeq1$.

Problem

Research questions and friction points this paper is trying to address.

Bayesian methods

Pitman-Yor prior

unobserved species problem

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pitman-Yor Prior

Bayesian Nonparametric Method

Unobserved Species Prediction

🔎 Similar Papers

Variational Learning of Gaussian Process Latent Variable Models through Stochastic Gradient Annealed Importance Sampling