Beyond Overconfidence: Foundation Models Redefine Calibration in Deep Neural Networks

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically uncovers counterintuitive uncertainty calibration behaviors of foundational vision models (ConvNeXt, EVA, BEiT): they are consistently underconfident on in-distribution data yet surprisingly robust on out-of-distribution (OOD) inputs—challenging the prevailing assumption that stronger models inherently calibrate better. We further identify a non-monotonic relationship between model capability and calibration performance, and reveal that posterior calibration can fail—or even degrade confidence reliability—under severe distributional shift. Method: We establish an empirical evaluation framework grounded in Expected Calibration Error (ECE) and Brier Score, integrating temperature scaling, vector scaling, and TS-Dirichlet across diverse architectures and datasets. Contribution/Results: On ImageNet, in-distribution ECE increases with model scale, while OOD ECE decreases substantially. Posterior calibration reduces in-distribution ECE by over 60%, yet under strong distribution shift, calibration error rebounds by up to 2.3×, exposing critical limitations of standard calibration techniques.

Technology Category

Application Category

📝 Abstract
Reliable uncertainty calibration is essential for safely deploying deep neural networks in high-stakes applications. Deep neural networks are known to exhibit systematic overconfidence, especially under distribution shifts. Although foundation models such as ConvNeXt, EVA and BEiT have demonstrated significant improvements in predictive performance, their calibration properties remain underexplored. This paper presents a comprehensive investigation into the calibration behavior of foundation models, revealing insights that challenge established paradigms. Our empirical analysis shows that these models tend to be underconfident in in-distribution predictions, resulting in higher calibration errors, while demonstrating improved calibration under distribution shifts. Furthermore, we demonstrate that foundation models are highly responsive to post-hoc calibration techniques in the in-distribution setting, enabling practitioners to effectively mitigate underconfidence bias. However, these methods become progressively less reliable under severe distribution shifts and can occasionally produce counterproductive results. Our findings highlight the complex, non-monotonic effects of architectural and training innovations on calibration, challenging established narratives of continuous improvement.
Problem

Research questions and friction points this paper is trying to address.

Investigates calibration behavior of foundation models in deep neural networks
Examines underconfidence in in-distribution predictions and improved shift calibration
Assesses post-hoc calibration effectiveness under distribution shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Foundation models reduce overconfidence under distribution shifts
Post-hoc calibration mitigates underconfidence in in-distribution predictions
Architectural innovations impact calibration non-monotonically
🔎 Similar Papers
A
Achim Hekler
German Cancer Research Center (DKFZ), Heidelberg, Germany.
Lukas Kuhn
Lukas Kuhn
Researcher, DKFZ
Machine LearningNeuroscience
Florian Buettner
Florian Buettner
Frankfurt University/DKFZ