🤖 AI Summary
Addressing the fundamental challenge in binary decision-making—namely, the difficulty of designing a calibration metric for probabilistic forecasts that simultaneously satisfies statistical testability and actionability—this paper introduces the Cutoff Calibration Error (CCE). CCE is the first calibration measure that jointly guarantees statistical identifiability (i.e., testability via finite-sample consistent estimation) and decision-theoretic validity. It is constructed via probability-based cutoffs and enables consistent estimation through interval-based binning, while naturally supporting threshold-sensitive decision optimization. In contrast to the untestable Expected Calibration Error (ECE) and the decision-theoretically unjustified decoupled Calibration Error (dCE), CCE satisfies theoretical guarantees of calibration consistency and robustness under distributional shifts. Empirically, CCE significantly improves diagnostic accuracy for miscalibration in high-stakes settings and enhances the efficacy of post-hoc calibration methods—including isotonic regression and Platt scaling—yielding more reliable and actionable calibrated probabilities.
📝 Abstract
Forecast probabilities often serve as critical inputs for binary decision making. In such settings, calibration$unicode{x2014}$ensuring forecasted probabilities match empirical frequencies$unicode{x2014}$is essential. Although the common notion of Expected Calibration Error (ECE) provides actionable insights for decision making, it is not testable: it cannot be empirically estimated in many practical cases. Conversely, the recently proposed Distance from Calibration (dCE) is testable but is not actionable since it lacks decision-theoretic guarantees needed for high-stakes applications. We introduce Cutoff Calibration Error, a calibration measure that bridges this gap by assessing calibration over intervals of forecasted probabilities. We show that Cutoff Calibration Error is both testable and actionable and examine its implications for popular post-hoc calibration methods, such as isotonic regression and Platt scaling.