🤖 AI Summary
Existing approaches struggle to holistically evaluate uncertainty-aware systems in high-stakes decision-making, as they either decouple predictive accuracy from uncertainty quality or rely on fixed cost functions. This work proposes the ECUASₙ family of metrics, which formalizes this evaluation for the first time as a parameterized proper scoring rule. By introducing a tunable parameter \( n \), ECUASₙ flexibly balances the costs of erroneous predictions against miscalibrated uncertainty estimates. Grounded in probabilistic scoring theory, ECUASₙ integrates coverage–risk curves with a cost-sensitive learning framework. Empirical validation across diverse classification and generative tasks—including a human-annotated TriviaQA subset—demonstrates both its theoretical coherence and practical superiority over existing evaluation protocols.
📝 Abstract
In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -- to accept or reject predictions based on application-specific cost trade-offs. Such uncertainty-augmented (UA) systems -- i.e., systems that output both predictions and uncertainty scores -- are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve. We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, \ECUAS{n}, formulated as proper scoring rules for the task of interest. The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case. We demonstrate the advantages of the \ECUAS{n} metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.