🤖 AI Summary
Current LLM uncertainty quantification (UQ) research suffers from three systemic biases: low ecological validity of evaluation benchmarks, neglect of aleatoric uncertainty, and optimization metrics decoupled from real-world decision utility—hindering UQ’s capacity to support reliable human-AI collaboration. This paper introduces the first UQ evaluation paradigm centered on human decision utility, integrating human factors assessment frameworks with task-level utility metrics to empirically evaluate 40 UQ methods. It innovatively advocates a “user-centered” design principle that jointly optimizes cognitive trustworthiness and task adaptability. Through rigorous analysis, the study exposes critical flaws in prevailing UQ practices—including overreliance on calibration-centric metrics—and provides actionable, implementation-ready improvement pathways. By shifting focus from model calibration alone to context-aware reliability enhancement in human-AI decision-making, this work advances UQ toward operational robustness in real-world collaborative settings.
📝 Abstract
Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community's progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.