From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current LLM uncertainty quantification (UQ) research suffers from three systemic biases: low ecological validity of evaluation benchmarks, neglect of aleatoric uncertainty, and optimization metrics decoupled from real-world decision utility—hindering UQ’s capacity to support reliable human-AI collaboration. This paper introduces the first UQ evaluation paradigm centered on human decision utility, integrating human factors assessment frameworks with task-level utility metrics to empirically evaluate 40 UQ methods. It innovatively advocates a “user-centered” design principle that jointly optimizes cognitive trustworthiness and task adaptability. Through rigorous analysis, the study exposes critical flaws in prevailing UQ practices—including overreliance on calibration-centric metrics—and provides actionable, implementation-ready improvement pathways. By shifting focus from model calibration alone to context-aware reliability enhancement in human-AI decision-making, this work advances UQ toward operational robustness in real-world collaborative settings.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community's progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.

Problem

Research questions and friction points this paper is trying to address.

Current LLM uncertainty quantification lacks human-centered design

Existing UQ methods use low-validity benchmarks and limited uncertainty types

Optimization metrics don't reflect real-world user decision needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focus on human-centered uncertainty quantification

Address low ecological validity in benchmarks

Consider both epistemic and aleatoric uncertainties

🔎 Similar Papers

Large Language Models Must Be Taught to Know What They Don't Know