Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels

📅 2025-04-25

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

In supervised learning, missing labels during model evaluation—particularly under non-ignorable missingness (MNAR)—introduce bias in performance estimation; existing approaches often discard incomplete samples without theoretical guarantees. This paper systematically addresses unbiased performance evaluation under MNAR label missingness at evaluation time. We propose an uncertainty-aware framework based on multiple imputation that models the predictive distribution—not just point estimates—of metrics including ROC-AUC, precision, and recall. Our method explicitly models the MNAR mechanism and conducts rigorous error analysis. We theoretically establish that the imputed metric estimators are asymptotically Gaussian, enjoy finite-sample convergence, and are robust to MNAR, providing tight convergence bounds. Empirical and theoretical validation confirms that our approach accurately captures both location and shape of the metric distribution under MNAR, significantly enhancing evaluation reliability.

Technology Category

Application Category

📝 Abstract

Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution's location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.

Problem

Research questions and friction points this paper is trying to address.

Addressing bias in classifier evaluation with missing labels

Proposing multiple imputation for robust metric estimation

Validating predictive distribution accuracy under MNAR conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple imputation for missing label evaluation

Predictive distribution for classifier metrics

Gaussian approximation with convergence bounds

🔎 Similar Papers

Estimating Model Performance Under Covariate Shift Without Labels