Learning with Positive and Imperfect Unlabeled Data

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the Positive-Imperfect Unlabeled (PIU) learning problem: binary classification from positive examples and unlabeled data drawn from a shifted distribution—violating the covariate shift invariance assumption. We establish the first sample complexity characterization for PIU learning and propose the first efficiently implementable, provably consistent algorithm achieving ε-classification error. Methodologically, our approach integrates Massart noise modeling, robust design against distributional shift, polynomial ℓ₁-truncated set estimation, and learnability analysis of exponential-family parameters. Key theoretical contributions include: (i) circumventing classical impossibility results to enable nontrivial concept class learning from positives only; (ii) achieving optimal statistical and computational efficiency; and (iii) extending PIU learning to novel paradigms—including smooth distribution learning, identification among multiple candidate distributions, and unknown truncation estimation and detection—significantly improving upon recent FOCS’24 and STOC’24 results.

Technology Category

Application Category

📝 Abstract
We study the problem of learning binary classifiers from positive and unlabeled data when the unlabeled data distribution is shifted, which we call Positive and Imperfect Unlabeled (PIU) Learning. In the absence of covariate shifts, i.e., with perfect unlabeled data, Denis (1998) reduced this problem to learning under Massart noise; however, that reduction fails under even slight shifts. Our main results on PIU learning are the characterizations of the sample complexity of PIU learning and a computationally and sample-efficient algorithm achieving a misclassification error $varepsilon$. We further show that our results lead to new algorithms for several related problems. 1. Learning from smooth distributions: We give algorithms that learn interesting concept classes from only positive samples under smooth feature distributions, bypassing known existing impossibility results and contributing to recent advances in smoothened learning (Haghtalab et al, J.ACM'24) (Chandrasekaran et al., COLT'24). 2. Learning with a list of unlabeled distributions: We design new algorithms that apply to a broad class of concept classes under the assumption that we are given a list of unlabeled distributions, one of which--unknown to the learner--is $O(1)$-close to the true feature distribution. 3. Estimation in the presence of unknown truncation: We give the first polynomial sample and time algorithm for estimating the parameters of an exponential family distribution from samples truncated to an unknown set approximable by polynomials in $L_1$-norm. This improves the algorithm by Lee et al. (FOCS'24) that requires approximation in $L_2$-norm. 4. Detecting truncation: We present new algorithms for detecting whether given samples have been truncated (or not) for a broad class of non-product distributions, including non-product distributions, improving the algorithm by De et al. (STOC'24).
Problem

Research questions and friction points this paper is trying to address.

Learning binary classifiers from shifted unlabeled data
Developing efficient algorithms for misclassification error reduction
Addressing related problems like smooth distribution learning and truncation detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning binary classifiers with shifted unlabeled data
Algorithms for smooth distributions and list learning
Polynomial sample-time for truncated exponential families
🔎 Similar Papers
No similar papers found.