DIRECT: Deep Active Learning under Imbalance and Label Noise

📅 2023-12-14
🏛️ arXiv.org
📈 Citations: 9
Influential: 2
📄 PDF
🤖 AI Summary
To address sample selection bias arising from the coexistence of class imbalance and label noise in deep active learning, this paper proposes a robust one-dimensional threshold-driven active learning paradigm. The method jointly models class imbalance and label noise—first achieved in active learning—and employs deep feature embedding followed by one-dimensional projection to robustly estimate an inter-class separation threshold. This threshold defines a priority region near the decision boundary where high-uncertainty samples are selected for labeling. The framework is theoretically compatible with batch querying and label-noise tolerance. Evaluated on multiple imbalanced benchmark datasets, it reduces annotation cost by over 60% compared to state-of-the-art active learning methods and improves accuracy by more than 80% relative to random sampling, while significantly enhancing minority-class recognition performance.
📝 Abstract
Class imbalance is a prevalent issue in real world machine learning applications, often leading to poor performance in rare and minority classes. With an abundance of wild unlabeled data, active learning is perhaps the most effective technique in solving the problem at its root -- collecting a more balanced and informative set of labeled examples during annotation. Label noise is another common issue in data annotation jobs, which is especially challenging for active learning methods. In this work, we conduct the first study of active learning under both class imbalance and label noise. We propose a novel algorithm that robustly identifies the class separation threshold and annotates the most uncertain examples that are closest from it. Through a novel reduction to one-dimensional active learning, our algorithm DIRECT is able to leverage the classic active learning literature to address issues such as batch labeling and tolerance towards label noise. We present extensive experiments on imbalanced datasets with and without label noise. Our results demonstrate that DIRECT can save more than 60% of the annotation budget compared to state-of-art active learning algorithms and more than 80% of annotation budget compared to random sampling.
Problem

Research questions and friction points this paper is trying to address.

Addresses class imbalance impact on minority class performance
Proposes active learning to collect balanced informative examples
Handles label noise and reduces annotation costs significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

DIRECT algorithm optimizes class separation boundaries
Reduces annotation costs by over 60%
Maintains robustness to label noise
🔎 Similar Papers
No similar papers found.