🤖 AI Summary
Existing uncertainty quantification methods suffer from strong task specificity and poor cross-task transferability. To address this, we propose a unified framework grounded in strictly proper scoring rules, supporting classification, regression, and generative modeling within a single principled paradigm. Our contributions are threefold: (1) We leverage Bregman divergences to derive a bias–variance decomposition of uncertainty, enabling fine-grained uncertainty analysis; (2) We introduce kernel scoring and kernel spherical scoring—previously unexplored in generative modeling—to define proper calibration error and extend the calibration–sharpness decomposition beyond classification to regression and generation; (3) We demonstrate substantial improvements in uncertainty estimation for large language models, with empirical validation across image, audio, and text generation tasks. The framework yields interpretable evaluation metrics and introduces novel calibration diagnostic tools for generative systems.
📝 Abstract
In this PhD thesis, we propose a novel framework for uncertainty quantification in machine learning, which is based on proper scores. Uncertainty quantification is an important cornerstone for trustworthy and reliable machine learning applications in practice. Usually, approaches to uncertainty quantification are problem-specific, and solutions and insights cannot be readily transferred from one task to another. Proper scores are loss functions minimized by predicting the target distribution. Due to their very general definition, proper scores apply to regression, classification, or even generative modeling tasks. We contribute several theoretical results, that connect epistemic uncertainty, aleatoric uncertainty, and model calibration with proper scores, resulting in a general and widely applicable framework. We achieve this by introducing a general bias-variance decomposition for strictly proper scores via functional Bregman divergences. Specifically, we use the kernel score, a kernel-based proper score, for evaluating sample-based generative models in various domains, like image, audio, and natural language generation. This includes a novel approach for uncertainty estimation of large language models, which outperforms state-of-the-art baselines. Further, we generalize the calibration-sharpness decomposition beyond classification, which motivates the definition of proper calibration errors. We then introduce a novel estimator for proper calibration errors in classification, and a novel risk-based approach to compare different estimators for squared calibration errors. Last, we offer a decomposition of the kernel spherical score, another kernel-based proper score, allowing a more fine-grained and interpretable evaluation of generative image models.