🤖 AI Summary
This work addresses a fundamental challenge in discovery sampling with presence–absence data: determining whether all categories exceeding a given prevalence threshold have been observed. The authors establish the first non-asymptotic, distribution-free, and data-dependent upper confidence bound on the maximum unseen probability—the highest prevalence among unobserved categories—applicable to both bounded and unbounded category spaces, thereby overcoming the limitations of data-independent approaches. Leveraging this bound, they design a sequential stopping rule with finite-sample guarantees and prove its near-optimality via matching upper and lower bounds. The method is developed under a Bernoulli product model through nonparametric inference and worst-case analysis, exhibits robustness to contaminated data, and demonstrates reliable performance in guiding sampling termination decisions across both simulated and real-world datasets.
📝 Abstract
Discovery problems often require deciding whether additional sampling is needed to detect all categories whose prevalence exceeds a prespecified threshold. We study this question under a Bernoulli product (incidence) model, where categories are observed only through presence--absence across sampling units. Our inferential target is the \emph{maximum unseen probability}, the largest prevalence among categories not yet observed. We develop nonasymptotic, distribution-free upper confidence bounds for this quantity in two regimes: bounded alphabets (finite and known number of categories) and unbounded alphabets (countably infinite under a mild summability condition). We characterise the limits of data-independent worst-case bounds, showing that in the unbounded regime no nontrivial data-independent procedure can be uniformly valid. We then propose data-dependent bounds in both regimes and establish matching lower bounds demonstrating their near-optimality. We compare empirically the resulting procedures in both simulated and real datasets. Finally, we use these bounds to construct sequential stopping rules with finite-sample guarantees, and demonstrate robustness to contamination that introduces spurious low-prevalence categories.