A Perfectly Truthful Calibration Measure

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing calibration metrics lack truthfulness under finite samples, incentivizing models to “lie” (i.e., output miscalibrated predictions) to artificially improve scores; moreover, no known metric is fully truthful in the batch setting. Method: We propose Average Two-Bin Calibration Error (ATB), the first strictly truthful, complete, and continuous calibration metric for batch evaluation. Its expectation is minimized uniquely when predicted probabilities equal true class probabilities, ensuring unbiased calibration assessment. ATB is mathematically equivalent to smoothed calibration error and the distance-to-calibration quadratic form, and naturally generalizes to novel metrics such as quantile ℓ₂-ECE. Contribution/Results: Leveraging ATB, we design an efficient estimation algorithm that substantially improves computational speed and implementation simplicity. Empirically, our estimator achieves superior finite-sample performance compared to state-of-the-art alternatives.

Technology Category

Application Category

📝 Abstract

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

Problem

Research questions and friction points this paper is trying to address.

Designing a perfectly truthful calibration measure in batch settings

Addressing lack of truthfulness in existing calibration measures

Improving efficiency and simplicity in calibration testing algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perfectly truthful calibration measure ATB

Efficient computation with simple definition

General recipe for truthful measures

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art