Making and Evaluating Calibrated Forecasts

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing calibration metrics for multiclass classification—such as the Expected Calibration Error (ECE)—suffer from non-robustness and untruthfulness due to sensitivity to binning strategies, leading to misleading optimization incentives (e.g., over- or under-confidence). Method: We propose the first *truthful* calibration metric for multiclass predictions, grounded in a generalization framework under the batch setting. Theoretically, it guarantees truthfulness: its global optimum is uniquely attained only by perfectly calibrated predictors, with no incentive to deviate toward miscalibrated solutions. Contribution/Results: Through rigorous theoretical analysis and extensive empirical evaluation, we demonstrate that our metric achieves ranking robustness across diverse discretization schemes and hyperparameters—significantly outperforming ECE and other state-of-the-art measures. Experiments confirm its superior ability to reflect true relative calibration performance among models, establishing it as a reliable, interpretable, and theoretically sound benchmark for multiclass uncertainty quantification.

Technology Category

Application Category

📝 Abstract

Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.

Problem

Research questions and friction points this paper is trying to address.

Designing truthful calibration measures for multi-class prediction tasks

Generalizing truthful calibration from binary to multi-class settings

Addressing robustness issues in existing calibration evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perfectly truthful calibration measure for multi-class tasks

Generalizes truthful binary calibration to multi-class prediction

Mathematically proven robust ordering preservation for predictors

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art