Adaptive Label Error Detection: A Bayesian Approach to Mislabeled Data Detection

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Label noise significantly degrades model performance, necessitating effective detection methods. This work proposes Adaptive Label Error Detection (ALED), which first extracts and denoises intermediate features from deep convolutional networks, then models each class as a multivariate Gaussian distribution on a low-dimensional manifold, and finally identifies mislabeled samples via a Bayesian likelihood ratio test. ALED is the first approach to integrate feature denoising, class-conditional Gaussian modeling, and likelihood ratio testing within a unified framework. Evaluated on multiple medical imaging datasets, it substantially outperforms existing techniques; fine-tuning models with labels corrected by ALED reduces test error rates by 33.8%, achieving both high sensitivity and precision.

Technology Category

Application Category

📝 Abstract
Machine learning classification systems are susceptible to poor performance when trained with incorrect ground truth labels, even when data is well-curated by expert annotators. As machine learning becomes more widespread, it is increasingly imperative to identify and correct mislabeling to develop more powerful models. In this work, we motivate and describe Adaptive Label Error Detection (ALED), a novel method of detecting mislabeling. ALED extracts an intermediate feature space from a deep convolutional neural network, denoises the features, models the reduced manifold of each class with a multidimensional Gaussian distribution, and performs a simple likelihood ratio test to identify mislabeled samples. We show that ALED has markedly increased sensitivity, without compromising precision, compared to established label error detection methods, on multiple medical imaging datasets. We demonstrate an example where fine-tuning a neural network on corrected data results in a 33.8% decrease in test set errors, providing strong benefits to end users. The ALED detector is deployed in the Python package statlab.
Problem

Research questions and friction points this paper is trying to address.

label error detection
mislabeled data
classification
data quality
machine learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Label Error Detection
Bayesian approach
feature denoising
Gaussian class modeling
likelihood ratio test
🔎 Similar Papers
No similar papers found.
Zan Chaudhry
Zan Chaudhry
PhD Student, Harvard University
Physics of Living SystemsApplied MathematicsMachine LearningBiomedical Engineering
Noam H. Rotenberg
Noam H. Rotenberg
Post-baccalaureate Research Fellow at the National Library of Medicine, NIH
clinical decision supportmachine learningnatural language processing
B
B. Caffo
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205 USA
Craig Jones
Craig Jones
Johns Hopkins University
medical imagingneural networksartificial intelligence
H
Haris I. Sair
Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore MD 21287 USA, and the Malone Center for Engineering in Healthcare, The Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218 USA