Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical deployment of machine learning models often relies on traditional metrics (e.g., AUC, accuracy) while neglecting critical clinical requirements—namely, predictive calibration, robustness to distributional shift, and asymmetric misclassification costs. To address this gap, we propose a clinically grounded evaluation framework. Methodologically, we unify Schervish’s characterization theory with clinical domain knowledge to formulate a cost-weighted, class-balanced, and calibration-sensitive cross-entropy objective derived from proper scoring rules. Concurrently, the framework integrates explicit calibration assessment, label distribution shift modeling, and asymmetric error cost modeling. Empirically, it preserves discriminative performance while substantially improving decision reliability under real-world conditions. Our approach achieves joint optimization of calibration, distributional robustness, and cost sensitivity—yielding an interpretable, deployable evaluation paradigm tailored for clinical AI systems.

Technology Category

Application Category

📝 Abstract
Machine learning-based decision support systems are increasingly deployed in clinical settings, where probabilistic scoring functions are used to inform and prioritize patient management decisions. However, widely used scoring rules, such as accuracy and AUC-ROC, fail to adequately reflect key clinical priorities, including calibration, robustness to distributional shifts, and sensitivity to asymmetric error costs. In this work, we propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers that explicitly accounts for the uncertainty in class prevalences and domain-specific cost asymmetries often found in clinical settings. Building on the theory of proper scoring rules, particularly the Schervish representation, we derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance. The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
Problem

Research questions and friction points this paper is trying to address.

Align evaluation metrics with clinical priorities
Address calibration and robustness to distribution shifts
Account for asymmetric error costs in clinical settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibrated thresholded classifiers selection framework
Adjusted cross-entropy for clinical cost-weighting
Robust evaluation sensitive to class balance variations
🔎 Similar Papers
No similar papers found.
G
Gerardo A. Flores
Massachusetts Institute of Technology
A
Alyssa H. Smith
Northeastern University
J
Julia A. Fukuyama
Indiana University
Ashia C. Wilson
Ashia C. Wilson
Assistant Professor at MIT
Machine LearningOptimizationDynamical SystemsStatistics