Label-free estimation of clinically relevant performance metrics under distribution shifts

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

In real-world deployment of medical image classification models, performance monitoring remains challenging due to distribution shift, severe class imbalance, and the absence of ground-truth labels. To address this, we propose a novel confidence-score-based method for estimating the full confusion matrix without requiring any labels—marking the first approach capable of directly predicting all confusion matrix entries in a label-free setting. Our method jointly models covariate and prevalence shift within a unified framework and is validated on chest X-ray data under both synthetic and real-world distribution shifts. Results demonstrate high accuracy in estimating clinically critical metrics (e.g., true positives and false negatives), substantially outperforming existing performance prediction methods. Moreover, our analysis uncovers a systematic failure mechanism of mainstream approaches under prevalence shift. This work establishes an interpretable, robust, and clinically deployable pathway for unsupervised monitoring of AI models in healthcare.

Technology Category

Application Category

📝 Abstract

Performance monitoring is essential for safe clinical deployment of image classification models. However, because ground-truth labels are typically unavailable in the target dataset, direct assessment of real-world model performance is infeasible. State-of-the-art performance estimation methods address this by leveraging confidence scores to estimate the target accuracy. Despite being a promising direction, the established methods mainly estimate the model's accuracy and are rarely evaluated in a clinical domain, where strong class imbalances and dataset shifts are common. Our contributions are twofold: First, we introduce generalisations of existing performance prediction methods that directly estimate the full confusion matrix. Then, we benchmark their performance on chest x-ray data in real-world distribution shifts as well as simulated covariate and prevalence shifts. The proposed confusion matrix estimation methods reliably predicted clinically relevant counting metrics on medical images under distribution shifts. However, our simulated shift scenarios exposed important failure modes of current performance estimation techniques, calling for a better understanding of real-world deployment contexts when implementing these performance monitoring techniques for postmarket surveillance of medical AI models.

Problem

Research questions and friction points this paper is trying to address.

Estimating clinical performance metrics without labels under distribution shifts

Extending accuracy estimation to full confusion matrix prediction

Evaluating methods on medical images with real and simulated shifts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates full confusion matrix under shifts

Benchmarks on chest x-ray distribution shifts

Predicts clinical metrics reliably on medical images

🔎 Similar Papers

Estimating Model Performance Under Covariate Shift Without Labels