🤖 AI Summary
Existing paradigms for modeling epistemic uncertainty—set-based and distribution-based approaches—are difficult to compare fairly due to differences in semantics, underlying assumptions, and evaluation protocols. This work proposes a controlled experimental framework that enables a direct, platform-consistent comparison of these two representation schemes by evaluating them on a shared set of predictive distributions generated from identical neural networks, thereby isolating the effect of the representation itself from model-specific variations. By unifying the modeling of credible sets and posterior parameter distributions and employing three distinct uncertainty metrics, the study systematically assesses performance across eight benchmark tasks, six model architectures, and ten independent runs. This approach constitutes the first fair and comparable evaluation of the two paradigms, revealing their relative strengths and suitable application contexts in selective prediction and out-of-distribution detection.
📝 Abstract
Epistemic uncertainty in neural networks is commonly modeled using two second-order paradigms: distribution-based representations, which rely on posterior parameter distributions, and set-based representations based on credal sets (convex sets of probability distributions). These frameworks are often regarded as fundamentally non-comparable due to differing semantics, assumptions, and evaluation practices, leaving their relative merits unclear. Empirical comparisons are further confounded by variations in the underlying predictive models. To clarify this issue, we present a controlled comparative study enabling principled, like-for-like evaluation of the two paradigms. Both representations are constructed from the same finite collection of predictive distributions generated by a shared neural network, isolating representational effects from predictive accuracy. Our study evaluates each representation through the lens of 3 uncertainty measures across 8 benchmarks, including selective prediction and out-of-distribution detection, spanning 6 underlying predictive models and 10 independent runs per configuration. Our results show that meaningful comparison between these seemingly non-comparable frameworks is both feasible and informative, providing insights into how second-order representation choices impact practical uncertainty-aware performance.