đ¤ AI Summary
Large language models (LLMs) exhibit reliability challenges in high-stakes domains (e.g., healthcare, law), where uncertainty arises from multiple interdependent sourcesâambiguous inputs, divergent reasoning paths, parametric stochasticity, and output randomnessâextending well beyond classical aleatoric/epistemic dichotomies. To address this, we propose the first four-dimensional uncertainty taxonomy for LLMsâcategorizing uncertainty along input, reasoning, parameter, and prediction axesâovercoming key limitations of conventional uncertainty quantification (UQ) in dimensional coverage and computational scalability. We systematically evaluate over twenty UQ methodsâincluding Bayesian approximation, ensemble sampling, logit calibration, attention-based analysis, and consistency verificationâacross real-world tasks to characterize their applicability boundaries and failure modes. Finally, we introduce a comprehensive evaluation framework balancing interpretability, robustness, and practicality, providing both theoretical foundations and actionable guidelines for deploying trustworthy LLMs in safety-critical applications.
đ Abstract
Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.