π€ AI Summary
Existing calibration metrics lack truthfulness under finite samples, incentivizing models to βlieβ (i.e., output miscalibrated predictions) to artificially improve scores; moreover, no known metric is fully truthful in the batch setting. Method: We propose Average Two-Bin Calibration Error (ATB), the first strictly truthful, complete, and continuous calibration metric for batch evaluation. Its expectation is minimized uniquely when predicted probabilities equal true class probabilities, ensuring unbiased calibration assessment. ATB is mathematically equivalent to smoothed calibration error and the distance-to-calibration quadratic form, and naturally generalizes to novel metrics such as quantile ββ-ECE. Contribution/Results: Leveraging ATB, we design an efficient estimation algorithm that substantially improves computational speed and implementation simplicity. Empirically, our estimator achieves superior finite-sample performance compared to state-of-the-art alternatives.
π Abstract
Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting.
We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.