🤖 AI Summary
This study addresses the critical challenge of efficiently obtaining reliable model uncertainty under resource-constrained and low-latency conditions. It presents a systematic evaluation of BatchEnsemble in terms of accuracy, calibration, and out-of-distribution (OOD) detection performance. Through comprehensive empirical analyses—including comparisons with deep ensembles, calibration assessments, and controlled investigations of functional and parameter-space similarity among ensemble members on MNIST—the work reveals, for the first time, that BatchEnsemble members exhibit high homogeneity and lack diversity. Results show that BatchEnsemble performs comparably to a single model on CIFAR-10, CIFAR-10-C, and SVHN, while its members on MNIST are nearly identical, failing to capture the predictive diversity characteristic of true ensembles. These findings cast doubt on BatchEnsemble’s effectiveness as an efficient ensemble method.
📝 Abstract
In resource-constrained and low-latency settings, uncertainty estimates must be efficiently obtained. Deep Ensembles provide robust epistemic uncertainty (EU) but require training multiple full-size models. BatchEnsemble aims to deliver ensemble-like EU at far lower parameter and memory cost by applying learned rank-1 perturbations to a shared base network. We show that BatchEnsemble not only underperforms Deep Ensembles but closely tracks a single model baseline in terms of accuracy, calibration and out-of-distribution (OOD) detection on CIFAR10/10C/SVHN. A controlled study on MNIST finds members are near-identical in function and parameter space, indicating limited capacity to realize distinct predictive modes. Thus, BatchEnsemble behaves more like a single model than a true ensemble.