🤖 AI Summary
Deep ensembles commonly employ uniform averaging, ignoring performance disparities among constituent networks, thereby limiting accuracy, calibration, and out-of-distribution (OOD) detection capability. To address this, we propose soft Dawid-Skene (sDS) aggregation—the first adaptation of the Dawid-Skene framework to deep ensembles using soft labels. sDS employs an expectation-maximization (EM) algorithm to implicitly estimate each network’s confusion matrix, enabling performance-aware, dynamic weighting without requiring ground-truth labels. Evaluated on CIFAR and ImageNet benchmarks, sDS consistently outperforms simple averaging: it improves classification accuracy, reduces expected calibration error (ECE) by over 30%, and boosts OOD detection AUC by 5–12 percentage points. Crucially, sDS simultaneously enhances accuracy, calibration, and OOD robustness—achieving a more balanced and reliable ensemble behavior.
📝 Abstract
Ensembling in deep learning improves accuracy and calibration over single networks. The traditional aggregation approach, ensemble averaging, treats all individual networks equally by averaging their outputs. Inspired by crowdsourcing we propose an aggregation method called soft Dawid Skene for deep ensembles that estimates confusion matrices of ensemble members and weighs them according to their inferred performance. Soft Dawid Skene aggregates soft labels in contrast to hard labels often used in crowdsourcing. We empirically show the superiority of soft Dawid Skene in accuracy, calibration and out of distribution detection in comparison to ensemble averaging in extensive experiments.