🤖 AI Summary
This work addresses the calibration problem in probabilistic forecasting: rigorously defining, evaluating, and quantifying the discrepancy between predicted probabilities and the true data-generating distribution to support reliable downstream decision-making. We propose a unified “indistinguishability” framework, formalizing calibration as the extent to which the predicted and true distributions are indistinguishable under a specified class of discriminators. This is the first systematic unification of mainstream calibration metrics—including Expected Calibration Error (ECE) and Kernel Calibration Error (KCE)—as instances of discrimination failure under varying discriminator capacities. Leveraging statistical hypothesis testing, probability theory, and learning theory, we develop computationally tractable estimators for calibration error and establish theoretical links between calibration error and decision-theoretic risk. The framework provides a novel paradigm for calibration analysis and reveals the operational limits of existing metrics in real-world decision contexts.
📝 Abstract
Calibration is a classical notion from the forecasting literature which aims to address the question: how should predicted probabilities be interpreted? In a world where we only get to observe (discrete) outcomes, how should we evaluate a predictor that hypothesizes (continuous) probabilities over possible outcomes? The study of calibration has seen a surge of recent interest, given the ubiquity of probabilistic predictions in machine learning. This survey describes recent work on the foundational questions of how to define and measure calibration error, and what these measures mean for downstream decision makers who wish to use the predictions to make decisions. A unifying viewpoint that emerges is that of calibration as a form of indistinguishability, between the world hypothesized by the predictor and the real world (governed by nature or the Bayes optimal predictor). In this view, various calibration measures quantify the extent to which the two worlds can be told apart by certain classes of distinguishers or statistical measures.