🤖 AI Summary
This study addresses the issue of poor calibration in deep neural networks, which often exhibit overconfidence in erroneous predictions and lack reliable uncertainty estimates. The authors systematically compare two uncertainty quantification approaches—Monte Carlo Dropout (as a Bayesian approximation) and conformal prediction—on Fashion-MNIST, evaluating their reliability and calibration performance on H-CNN variants of VGG16 and GoogLeNet. Results show that while VGG16 achieves higher accuracy, it is more overconfident, whereas GoogLeNet demonstrates better calibration. Conformal prediction yields statistically valid prediction sets, offering rigorous finite-sample coverage guarantees while maintaining high empirical coverage, making it particularly suitable for high-stakes decision-making. The work underscores the necessity of evaluating models beyond accuracy by incorporating uncertainty quality and highlights the practical advantages of conformal prediction in real-world deployment scenarios.
📝 Abstract
Deep neural networks (DNNs) have become integral to a wide range of scientific and practical applications due to their flexibility and strong predictive performance. Despite their accuracy, however, DNNs frequently exhibit poor calibration, often assigning overly confident probabilities to incorrect predictions. This limitation underscores the growing need for integrated mechanisms that provide reliable uncertainty estimation. In this article, we compare two prominent approaches for uncertainty quantification: a Bayesian approximation via Monte Carlo Dropout and the nonparametric Conformal Prediction framework. Both methods are assessed using two convolutional neural network architectures; H-CNN VGG16 and GoogLeNet, trained on the Fashion-MNIST dataset. The empirical results show that although H-CNN VGG16 attains higher predictive accuracy, it tends to exhibit pronounced overconfidence, whereas GoogLeNet yields better-calibrated uncertainty estimates. Conformal Prediction additionally demonstrates consistent validity by producing statistically guaranteed prediction sets, highlighting its practical value in high-stakes decision-making contexts. Overall, the findings emphasize the importance of evaluating model performance beyond accuracy alone and contribute to the development of more reliable and trustworthy deep learning systems.