🤖 AI Summary
Deep ensembles of Bayesian neural networks (DE-BNNs) are widely assumed to outperform standard deep ensembles (DEs) across both in-distribution (ID) and out-of-distribution (OOD) tasks, yet their systematic performance trade-offs remain poorly understood. Method: We conduct a large-scale empirical evaluation of DE-BNNs across diverse datasets, architectures, and BNN approximation methods—including Monte Carlo Dropout, Laplace approximation, and MCMC—assessing accuracy, calibration, and OOD detection. Contribution/Results: We find that large-scale DEs consistently surpass DE-BNNs in ID accuracy and calibration, challenging the prevailing intuition that integrating Bayesian inference inherently improves ID performance. While DE-BNNs enhance OOD detection, this gain comes at a measurable cost to ID performance—an ID–OOD trade-off we are the first to rigorously quantify. We introduce a novel analytical perspective grounded in local posterior geometry and release a large, reproducible model zoo. Results hold robustly across multiple benchmarks, providing both theoretical insight and empirical guidance for ensemble design in trustworthy machine learning.
📝 Abstract
Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.