🤖 AI Summary
This paper addresses the challenge of disentangling underconfidence from overconfidence in evaluating confidence calibration for binary classification models. We propose the Entropy Calibration Difference (ECD), a novel metric that, for the first time, adapts state estimation principles from object tracking to calibration assessment. ECD quantifies directional calibration errors by modeling the divergence between predicted and true distributions via information entropy, and integrates binning with expected error decomposition to yield interpretable, sign-aware measurements of both under- and overconfidence. Unlike conventional metrics such as Expected Calibration Error (ECE) and Ensemble Soft Calibration Error (ESCE), ECD offers unambiguous semantic interpretation and demonstrates superior sensitivity and directional discrimination across multiple real-world and synthetic experiments. It thus enables joint characterization of model safety (robustness to miscalibration) and statistical efficiency (calibration fidelity), providing a more principled foundation for trustworthy uncertainty quantification.
📝 Abstract
Understanding the confidence with which a machine learning model classifies an input datum is an important, and perhaps under-investigated, concept. In this paper, we propose a new calibration metric, the Entropic Calibration Difference (ECD). Based on existing research in the field of state estimation, specifically target tracking (TT), we show how ECD may be applied to binary classification machine learning models. We describe the relative importance of under- and over-confidence and how they are not conflated in the TT literature. Indeed, our metric distinguishes under- from over-confidence. We consider this important given that algorithms that are under-confident are likely to be 'safer' than algorithms that are over-confident, albeit at the expense of also being over-cautious and so statistically inefficient. We demonstrate how this new metric performs on real and simulated data and compare with other metrics for machine learning model probability calibration, including the Expected Calibration Error (ECE) and its signed counterpart, the Expected Signed Calibration Error (ESCE).