🤖 AI Summary
Clinical deployment of machine learning models often relies on traditional metrics (e.g., AUC, accuracy) while neglecting critical clinical requirements—namely, predictive calibration, robustness to distributional shift, and asymmetric misclassification costs. To address this gap, we propose a clinically grounded evaluation framework. Methodologically, we unify Schervish’s characterization theory with clinical domain knowledge to formulate a cost-weighted, class-balanced, and calibration-sensitive cross-entropy objective derived from proper scoring rules. Concurrently, the framework integrates explicit calibration assessment, label distribution shift modeling, and asymmetric error cost modeling. Empirically, it preserves discriminative performance while substantially improving decision reliability under real-world conditions. Our approach achieves joint optimization of calibration, distributional robustness, and cost sensitivity—yielding an interpretable, deployable evaluation paradigm tailored for clinical AI systems.
📝 Abstract
Machine learning-based decision support systems are increasingly deployed in clinical settings, where probabilistic scoring functions are used to inform and prioritize patient management decisions. However, widely used scoring rules, such as accuracy and AUC-ROC, fail to adequately reflect key clinical priorities, including calibration, robustness to distributional shifts, and sensitivity to asymmetric error costs. In this work, we propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers that explicitly accounts for the uncertainty in class prevalences and domain-specific cost asymmetries often found in clinical settings. Building on the theory of proper scoring rules, particularly the Schervish representation, we derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance. The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.