Calibration in Deep Learning: A Survey of the State-of-the-Art

📅 2023-08-02

🏛️ arXiv.org

📈 Citations: 54

✨ Influential: 3

career value

190K/year

🤖 AI Summary

Deep neural networks (DNNs) exhibit strong predictive performance but suffer from poor calibration—i.e., their predicted confidence scores poorly reflect true correctness probabilities—hindering trustworthy deployment in safety-critical domains such as healthcare and autonomous driving. This paper presents the first comprehensive survey on DNN calibration, formally defining calibration and systematically analyzing miscalibration mechanisms. We propose a unified four-category taxonomy of calibration methods: post-hoc correction (e.g., temperature scaling, isotonic regression), regularization-based approaches (e.g., label smoothing), uncertainty-aware modeling (e.g., Monte Carlo Dropout, deep ensembles), and hybrid strategies. We further extend this framework to large language models (LLMs), identifying novel calibration challenges—including instruction sensitivity and scale-induced overconfidence—and charting paradigm-shift pathways. Finally, we introduce a principled evaluation framework that bridges theoretical foundations, methodological design, and empirical assessment, enabling end-to-end calibration analysis and supporting the reliable deployment of high-confidence AI systems.

📝 Abstract

Calibrating deep neural models plays an important role in building reliable, robust AI systems in safety-critical applications. Recent work has shown that modern neural networks that possess high predictive capability are poorly calibrated and produce unreliable model predictions. Though deep learning models achieve remarkable performance on various benchmarks, the study of model calibration and reliability is relatively underexplored. Ideal deep models should have not only high predictive performance but also be well calibrated. There have been some recent advances in calibrating deep models. In this survey, we review the state-of-the-art calibration methods and their principles for performing model calibration. First, we start with the definition of model calibration and explain the root causes of model miscalibration. Then we introduce the key metrics that can measure this aspect. It is followed by a summary of calibration methods that we roughly classify into four categories: post-hoc calibration, regularization methods, uncertainty estimation, and composition methods. We also cover recent advancements in calibrating large models, particularly large language models (LLMs). Finally, we discuss some open issues, challenges, and potential directions.

Problem

Research questions and friction points this paper is trying to address.

Surveying state-of-the-art calibration methods for deep neural networks

Addressing poor calibration in high-performance yet unreliable AI models

Reviewing calibration metrics, causes of miscalibration, and solution categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-hoc calibration methods

Regularization techniques application

Uncertainty estimation approaches

🔎 Similar Papers

No similar papers found.