🤖 AI Summary
This paper investigates the calibration robustness of image classification models under real-world distribution shifts. We systematically evaluate both training-time calibration methods (e.g., entropy regularization, label smoothing) and post-hoc techniques (e.g., temperature scaling) across eight cross-domain benchmarks. Our key findings are: (1) joint entropy regularization and label smoothing achieves optimal in-distribution and out-of-distribution calibration; (2) incorporating task-agnostic out-of-distribution data during post-hoc calibration significantly improves generalization across domains; (3) fine-tuning pretrained foundation models substantially outperforms from-scratch training; and (4) we propose “ensemble-before-calibration”—a novel paradigm that surpasses conventional ensemble-after-calibration. Experiments uncover an inherent trade-off between in-distribution and out-of-distribution calibration performance and demonstrate that simple, well-tuned post-hoc methods often outperform newly proposed complex approaches. Collectively, our work provides a reproducible, transferable practical guide for deploying reliable vision systems under distributional shift.
📝 Abstract
We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.