Focused PU learning from imbalanced data

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of learning from highly imbalanced data, where positive instances are not only scarce but also resemble negative examples, making them difficult to identify. To tackle this issue, the authors propose a focused empirical risk estimator that, for the first time within the positive-unlabeled (PU) learning framework, effectively handles both extreme class imbalance and hard-to-discriminate positive samples. The method is compatible with both SCAR (Selected Completely At Random) and SAR (Selected At Random) labeling assumptions and jointly models positive and unlabeled data to significantly enhance generalization under sparse annotation scenarios. Experimental results demonstrate that the proposed approach achieves state-of-the-art performance across multiple imbalanced benchmarks and successfully applies to real-world tasks such as financial misstatement detection.

📝 Abstract

We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely at random (SCAR) and selecting at random (SAR). Beyond these controlled experiments, we demonstrate the value of the proposed method in the real-world application of financial misstatement detection.

Problem

Research questions and friction points this paper is trying to address.

PU learning

imbalanced data

positive and unlabeled

hard-to-detect positives

binary classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

PU learning

imbalanced data

focused empirical risk estimator