Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Large language models (LLMs) exhibit reliability challenges in high-stakes domains (e.g., healthcare, law), where uncertainty arises from multiple interdependent sources—ambiguous inputs, divergent reasoning paths, parametric stochasticity, and output randomness—extending well beyond classical aleatoric/epistemic dichotomies. To address this, we propose the first four-dimensional uncertainty taxonomy for LLMs—categorizing uncertainty along input, reasoning, parameter, and prediction axes—overcoming key limitations of conventional uncertainty quantification (UQ) in dimensional coverage and computational scalability. We systematically evaluate over twenty UQ methods—including Bayesian approximation, ensemble sampling, logit calibration, attention-based analysis, and consistency verification—across real-world tasks to characterize their applicability boundaries and failure modes. Finally, we introduce a comprehensive evaluation framework balancing interpretability, robustness, and practicality, providing both theoretical foundations and actionable guidelines for deploying trustworthy LLMs in safety-critical applications.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.

Problem

Research questions and friction points this paper is trying to address.

Quantify uncertainty in Large Language Models (LLMs) outputs.

Address computational constraints and decoding inconsistencies in UQ methods.

Develop scalable and interpretable UQ approaches for LLM reliability.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces taxonomy for UQ methods in LLMs

Focuses on computational efficiency and uncertainty dimensions

Emphasizes scalable, interpretable, robust UQ approaches

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art