🤖 AI Summary
This work addresses the limitations of the standard Expected Calibration Error (ECE), which struggles to effectively capture overconfidence risks at high confidence levels and fails to evaluate the discriminative power of confidence scores with respect to prediction correctness. To overcome these issues, the authors propose the Calibrated Size Ratio (CSR) as a more sensitive calibration metric and introduce the risk probability \(P_{\text{risk}}\) to quantify overconfidence. Furthermore, they systematically extend confidence-weighting mechanisms to various classification metrics for the first time, yielding novel measures such as cwA and cwAUC to assess the discriminative ability of confidence estimates. Theoretical analysis and extensive experiments across 15 real-world and synthetic datasets demonstrate that CSR consistently exhibits superior sensitivity and specificity across diverse calibration scenarios, validating the effectiveness and robustness of the proposed approach.
📝 Abstract
Confidence calibration has been dominated by the Expected Calibration Error (ECE), a linear metric that counts calibration offset equally regardless of the confidence level at which it occurs. We show that ECE can remain small even under arbitrarily large overconfidence risk, so we propose Calibrated Size Ratio (CSR) instead, an interpretable metric that equals 1 under perfect calibration, from which we derive the risk probability $P_{\mathrm{risk}}$ that quantifies the statistical evidence for overconfidence. We further argue that overconfidence risk assessment must be complemented by a measure of discriminative value: whether the assigned confidences actively distinguish correct from incorrect predictions. We show that confidence-weighted accuracy $\mathrm{cwA}$ is the natural such complement, and that confidence-weighting extends to all standard classification metrics. In particular, we prove that the confidence-weighted AUC (cwAUC) captures the information about calibration while the classical AUC cannot. We validate the proposed indicators on several synthetic confidence distributions under multiple controlled calibration profiles and on fifteen real datasets with and without post-hoc calibration. Experiments demonstrate that CSR achieves near-perfect sensitivity and specificity across all tested conditions.