🤖 AI Summary
This work proposes CreDRO, a distributionally robust optimization–based ensemble learning framework for reliable prediction under distributional shift. Existing approaches to quantifying epistemic uncertainty often rely on training stochasticity and struggle to capture the deeper uncertainty arising from mismatches between training and test distributions. CreDRO addresses this limitation by extending the modeling of epistemic uncertainty beyond optimization randomness to explicitly account for potential distributional shifts, thereby relaxing the i.i.d. assumption and offering a more comprehensive uncertainty characterization. By integrating distributionally robust optimization, ensemble learning, and non-i.i.d. modeling, CreDRO consistently outperforms state-of-the-art methods across multiple benchmarks, demonstrating particularly strong performance in out-of-distribution detection and medical selective classification tasks.
📝 Abstract
Credal predictors are models that are aware of epistemic uncertainty and produce a convex set of probabilistic predictions. They offer a principled way to quantify predictive epistemic uncertainty (EU) and have been shown to improve model robustness in various settings. However, most state-of-the-art methods mainly define EU as disagreement caused by random training initializations, which mostly reflects sensitivity to optimization randomness rather than uncertainty from deeper sources. To address this, we define EU as disagreement among models trained with varying relaxations of the i.i.d. assumption between training and test data. Based on this idea, we propose CreDRO, which learns an ensemble of plausible models through distributionally robust optimization. As a result, CreDRO captures EU not only from training randomness but also from meaningful disagreement due to potential distribution shifts between training and test data. Empirical results show that CreDRO consistently outperforms existing credal methods on tasks such as out-of-distribution detection across multiple benchmarks and selective classification in medical applications.