CalArena: A Large-Scale Post-Hoc Calibration Benchmark

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the unreliability of probability estimates from modern classifiers and the absence of a unified, large-scale evaluation framework for post-hoc calibration methods. The authors construct the first comprehensive calibration benchmark encompassing nearly 2,000 experiments across tabular and computer vision tasks, integrating classical models, deep networks, and foundation models, and systematically reimplement dozens of calibration techniques within a consistent framework. They introduce a novel metric, Post-hoc Improvement (PHI), which combines proper scoring rules to jointly assess calibration quality and predictive performance. Key findings reveal that smoothing-based calibration consistently outperforms binning approaches, high-dimensional multiclass settings demand specialized strategies, and off-the-shelf foundation models exhibit poor calibration without explicit design considerations. All data, code, and tools are publicly released to enable plug-and-play research.

📝 Abstract

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.

Problem

Research questions and friction points this paper is trying to address.

post-hoc calibration

calibration benchmark

probability estimation

model calibration

proper scoring rules

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-hoc calibration

calibration benchmark

proper scoring rules