Trust, or Don't Predict: Introducing the CWSA Family for Confidence-Aware Model Evaluation

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing evaluation metrics—such as accuracy, Expected Calibration Error (ECE), and Area-Under-the-Reliability-Curve (AURC)—fail to capture *local reliability*: they either ignore confidence scores, average over heterogeneous regions (diluting critical details), or under-penalize overconfident misclassifications. To address this, we propose two novel metrics—Confidence-Weighted Selective Accuracy (CWSA) and its enhanced variant CWSA+—the first to jointly satisfy threshold locality, decomposability, and strong penalization of overconfident errors. Built upon confidence-weighted selective accuracy, CWSA/CWSA+ introduce normalized scoring and a locally sensitive evaluation framework, integrating calibration analysis with failure-mode diagnosis. Evaluated on MNIST, CIFAR-10, and diverse synthetic models, CWSA/CWSA+ consistently outperform baselines, sensitively detecting fine-grained miscalibration patterns (e.g., overconfidence, underconfidence). The metrics provide a theoretically grounded yet deployment-ready paradigm for risk quantification in safety-critical applications.

Technology Category

Application Category

📝 Abstract

In recent machine learning systems, confidence scores are being utilized more and more to manage selective prediction, whereby a model can abstain from making a prediction when it is unconfident. Yet, conventional metrics like accuracy, expected calibration error (ECE), and area under the risk-coverage curve (AURC) do not capture the actual reliability of predictions. These metrics either disregard confidence entirely, dilute valuable localized information through averaging, or neglect to suitably penalize overconfident misclassifications, which can be particularly detrimental in real-world systems. We introduce two new metrics Confidence-Weighted Selective Accuracy (CWSA) and its normalized variant CWSA+ that offer a principled and interpretable way to evaluate predictive models under confidence thresholds. Unlike existing methods, our metrics explicitly reward confident accuracy and penalize overconfident mistakes. They are threshold-local, decomposable, and usable in both evaluation and deployment settings where trust and risk must be quantified. Through exhaustive experiments on both real-world data sets (MNIST, CIFAR-10) and artificial model variants (calibrated, overconfident, underconfident, random, perfect), we show that CWSA and CWSA+ both effectively detect nuanced failure modes and outperform classical metrics in trust-sensitive tests. Our results confirm that CWSA is a sound basis for developing and assessing selective prediction systems for safety-critical domains.

Problem

Research questions and friction points this paper is trying to address.

Existing metrics fail to assess prediction reliability accurately

New metrics CWSA and CWSA+ evaluate confidence-aware model performance

CWSA metrics detect failure modes better in trust-sensitive contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CWSA for confidence-aware evaluation

Rewards confident accuracy, penalizes overconfidence

Threshold-local, decomposable metrics for trust

🔎 Similar Papers

Large Language Model Confidence Estimation via Black-Box Access