Confidence intervals for maximum unseen probabilities, with application to sequential sampling design

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses a fundamental challenge in discovery sampling with presence–absence data: determining whether all categories exceeding a given prevalence threshold have been observed. The authors establish the first non-asymptotic, distribution-free, and data-dependent upper confidence bound on the maximum unseen probability—the highest prevalence among unobserved categories—applicable to both bounded and unbounded category spaces, thereby overcoming the limitations of data-independent approaches. Leveraging this bound, they design a sequential stopping rule with finite-sample guarantees and prove its near-optimality via matching upper and lower bounds. The method is developed under a Bernoulli product model through nonparametric inference and worst-case analysis, exhibits robustness to contaminated data, and demonstrates reliable performance in guiding sampling termination decisions across both simulated and real-world datasets.

Technology Category

Application Category

📝 Abstract

Discovery problems often require deciding whether additional sampling is needed to detect all categories whose prevalence exceeds a prespecified threshold. We study this question under a Bernoulli product (incidence) model, where categories are observed only through presence--absence across sampling units. Our inferential target is the \emph{maximum unseen probability}, the largest prevalence among categories not yet observed. We develop nonasymptotic, distribution-free upper confidence bounds for this quantity in two regimes: bounded alphabets (finite and known number of categories) and unbounded alphabets (countably infinite under a mild summability condition). We characterise the limits of data-independent worst-case bounds, showing that in the unbounded regime no nontrivial data-independent procedure can be uniformly valid. We then propose data-dependent bounds in both regimes and establish matching lower bounds demonstrating their near-optimality. We compare empirically the resulting procedures in both simulated and real datasets. Finally, we use these bounds to construct sequential stopping rules with finite-sample guarantees, and demonstrate robustness to contamination that introduces spurious low-prevalence categories.

Problem

Research questions and friction points this paper is trying to address.

maximum unseen probability

sequential sampling

discovery problems

confidence intervals

incidence model

Innovation

Methods, ideas, or system contributions that make the work stand out.

maximum unseen probability

nonasymptotic confidence bounds

distribution-free inference