🤖 AI Summary
Existing calibration metrics for multiclass classification—such as the Expected Calibration Error (ECE)—suffer from non-robustness and untruthfulness due to sensitivity to binning strategies, leading to misleading optimization incentives (e.g., over- or under-confidence).
Method: We propose the first *truthful* calibration metric for multiclass predictions, grounded in a generalization framework under the batch setting. Theoretically, it guarantees truthfulness: its global optimum is uniquely attained only by perfectly calibrated predictors, with no incentive to deviate toward miscalibrated solutions.
Contribution/Results: Through rigorous theoretical analysis and extensive empirical evaluation, we demonstrate that our metric achieves ranking robustness across diverse discretization schemes and hyperparameters—significantly outperforming ECE and other state-of-the-art measures. Experiments confirm its superior ability to reflect true relative calibration performance among models, establishing it as a reliable, interpretable, and theoretically sound benchmark for multiclass uncertainty quantification.
📝 Abstract
Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting.
We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.