The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face critical challenges in uncertainty quantification (UQ), as prevailing UQ methods rely on the unambiguous-answer assumption and degrade to random performance under genuine linguistic ambiguity. Method: We identify the root cause of this limitation and introduce MAQA* and AmbigQA*, the first ambiguity-aware question-answering benchmarks featuring empirically grounded answer distributions. We theoretically prove that mainstream UQ paradigms—namely, prediction-distribution-based and model-ensemble-based approaches—fundamentally fail under ambiguity. To address this, we propose a novel uncertainty annotation framework integrating prediction distributions, internal model representations, and factual co-occurrence statistics. Contribution/Results: Extensive experiments demonstrate significant performance degradation of all state-of-the-art UQ methods on ambiguous data, exposing a fundamental flaw in current LLM uncertainty estimation. Our work establishes new empirical benchmarks and theoretical foundations for trustworthy LLM research.

Technology Category

Application Category

📝 Abstract
Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.
Problem

Research questions and friction points this paper is trying to address.

Current UQ methods fail under ambiguous language conditions
Existing benchmarks lack realistic ambiguity in LLM evaluation
Predictive and ensemble estimators show fundamental limitations with ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ambiguous QA datasets MAQA* and AmbigQA*
Benchmarks uncertainty estimators across multiple paradigms
Reveals theoretical limitations of predictive-distribution estimators
T
Tim Tomov
School of Computation, Information and Technology & Munich Data Science Institute, Technical University of Munich
D
Dominik Fuchsgruber
School of Computation, Information and Technology & Munich Data Science Institute, Technical University of Munich
Tom Wollschläger
Tom Wollschläger
PhD Student in Machine Learning, TUM
machine learninggraph neural networksquantum machine learninguncertainty and robustness
S
Stephan Gunnemann
School of Computation, Information and Technology & Munich Data Science Institute, Technical University of Munich