🤖 AI Summary
Existing evaluation metrics—such as accuracy, Expected Calibration Error (ECE), and Area-Under-the-Reliability-Curve (AURC)—fail to capture *local reliability*: they either ignore confidence scores, average over heterogeneous regions (diluting critical details), or under-penalize overconfident misclassifications. To address this, we propose two novel metrics—Confidence-Weighted Selective Accuracy (CWSA) and its enhanced variant CWSA+—the first to jointly satisfy threshold locality, decomposability, and strong penalization of overconfident errors. Built upon confidence-weighted selective accuracy, CWSA/CWSA+ introduce normalized scoring and a locally sensitive evaluation framework, integrating calibration analysis with failure-mode diagnosis. Evaluated on MNIST, CIFAR-10, and diverse synthetic models, CWSA/CWSA+ consistently outperform baselines, sensitively detecting fine-grained miscalibration patterns (e.g., overconfidence, underconfidence). The metrics provide a theoretically grounded yet deployment-ready paradigm for risk quantification in safety-critical applications.
📝 Abstract
In recent machine learning systems, confidence scores are being utilized more and more to manage selective prediction, whereby a model can abstain from making a prediction when it is unconfident. Yet, conventional metrics like accuracy, expected calibration error (ECE), and area under the risk-coverage curve (AURC) do not capture the actual reliability of predictions. These metrics either disregard confidence entirely, dilute valuable localized information through averaging, or neglect to suitably penalize overconfident misclassifications, which can be particularly detrimental in real-world systems. We introduce two new metrics Confidence-Weighted Selective Accuracy (CWSA) and its normalized variant CWSA+ that offer a principled and interpretable way to evaluate predictive models under confidence thresholds. Unlike existing methods, our metrics explicitly reward confident accuracy and penalize overconfident mistakes. They are threshold-local, decomposable, and usable in both evaluation and deployment settings where trust and risk must be quantified. Through exhaustive experiments on both real-world data sets (MNIST, CIFAR-10) and artificial model variants (calibrated, overconfident, underconfident, random, perfect), we show that CWSA and CWSA+ both effectively detect nuanced failure modes and outperform classical metrics in trust-sensitive tests. Our results confirm that CWSA is a sound basis for developing and assessing selective prediction systems for safety-critical domains.