🤖 AI Summary
This work investigates the statistical complexity of Positive-Unlabeled (PU) learning under unknown class-prior conditions. Unlike mainstream approaches that assume known class priors or strong sampling assumptions, we analyze PU learning in a more realistic setting where the positive class prior is unknown. We establish, for the first time, tight sample complexity bounds—both upper and lower—for the minimal number of positive and unlabeled samples required. Leveraging statistical learning theory and empirical process techniques, we rigorously derive these bounds and quantify their dependence on the (unknown) class prior, classifier complexity, and distribution shift. Our results demonstrate that PU learning remains statistically learnable even without prior knowledge, with sample requirements exceeding those of supervised learning only by a logarithmic factor. This work provides the first prior-free theoretical foundation for PU learning, substantially enhancing its applicability and practical guidance in real-world scenarios such as medical screening and anomaly detection.
📝 Abstract
PU (Positive Unlabeled) learning is a variant of supervised classification learning in which the only labels revealed to the learner are of positively labeled instances. PU learning arises in many real-world applications. Most existing work relies on the simplifying assumptions that the positively labeled training data is drawn from the restriction of the data generating distribution to positively labeled instances and/or that the proportion of positively labeled points (a.k.a. the class prior) is known apriori to the learner. This paper provides a theoretical analysis of the statistical complexity of PU learning under a wider range of setups. Unlike most prior work, our study does not assume that the class prior is known to the learner. We prove upper and lower bounds on the required sample sizes (of both the positively labeled and the unlabeled samples).