Performance Estimation in Binary Classification Using Calibrated Confidence

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the delayed performance monitoring caused by the absence of ground-truth labels post-deployment, this paper proposes the first label-free, multi-metric online estimation framework for binary classification tasks, supporting accuracy, precision, recall, and F1-score—metrics derived from the confusion matrix. Methodologically, it is the first to systematically generalize label-free performance estimation to *any* metric definable via the confusion matrix. By calibrating model confidence scores to model predictive probability distributions and leveraging stochastic variable propagation with Monte Carlo inference, the approach enables full probabilistic modeling of confusion matrix elements, yielding theoretically grounded point estimates and statistically valid confidence intervals. Experiments across multiple benchmark datasets demonstrate that the proposed method significantly outperforms existing baselines, achieving lower estimation errors across all metrics while maintaining nominal coverage rates for its confidence intervals.

Technology Category

Application Category

📝 Abstract

Model monitoring is a critical component of the machine learning lifecycle, safeguarding against undetected drops in the model's performance after deployment. Traditionally, performance monitoring has required access to ground truth labels, which are not always readily available. This can result in unacceptable latency or render performance monitoring altogether impossible. Recently, methods designed to estimate the accuracy of classifier models without access to labels have shown promising results. However, there are various other metrics that might be more suitable for assessing model performance in many cases. Until now, none of these important metrics has received similar interest from the scientific community. In this work, we address this gap by presenting CBPE, a novel method that can estimate any binary classification metric defined using the confusion matrix. In particular, we choose four metrics from this large family: accuracy, precision, recall, and F$_1$, to demonstrate our method. CBPE treats the elements of the confusion matrix as random variables and leverages calibrated confidence scores of the model to estimate their distributions. The desired metric is then also treated as a random variable, whose full probability distribution can be derived from the estimated confusion matrix. CBPE is shown to produce estimates that come with strong theoretical guarantees and valid confidence intervals.

Problem

Research questions and friction points this paper is trying to address.

Estimating binary classification metrics without ground truth labels

Addressing lack of methods for key metrics beyond accuracy

Providing theoretical guarantees for metric estimation confidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates any binary classification metric

Uses calibrated confidence scores

Derives full probability distributions

🔎 Similar Papers

QA-Calibration of Language Model Confidence Scores