Unstable Rankings in Bayesian Deep Learning Evaluation

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

246K/year
🤖 AI Summary
This work addresses the unreliability of standard Bayesian deep learning evaluations under data scarcity, where method rankings often exhibit instability and strong dataset dependence. To overcome this, the authors model evaluation metrics as random variables across data realizations and propose an uncertainty-aware assessment framework based on a Bayesian hierarchical model. This framework explicitly estimates method-specific variance and introduces Predictive Minimum Detectable Difference (Predictive MDD) curves to determine whether performance gaps can be reliably detected at a given training scale. Experiments across six Bayesian deep learning methods and five regression datasets demonstrate that existing conclusions drawn from low-data settings are frequently unreliable, whereas the proposed approach enables principled assessment of evaluation adequacy and supports dataset-specific posterior inference.

Technology Category

Application Category

📝 Abstract
Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small $n$, but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields $P(\mathrm{MCD} \prec \mathrm{Ensemble}) = 1.000$ at $n = 50$ on one dataset and remains below $0.95$ even at $n = 500$ on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across six Bayesian deep learning methods and five regression datasets, our results show that uncertainty-aware evaluation is necessary in low-data settings, because current evidence for method superiority and predictive detectability at the same training size can diverge substantially. Our framework provides practitioners with principled tools to determine whether their evaluation data is sufficient before drawing conclusions about method superiority.
Problem

Research questions and friction points this paper is trying to address.

Bayesian deep learning
evaluation reliability
data scarcity
method ranking
uncertainty-aware evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian hierarchical model
uncertainty-aware evaluation
Minimum Detectable Difference
data scarcity
method ranking instability
🔎 Similar Papers
No similar papers found.