🤖 AI Summary
Current uncertainty quantification methods for large language models typically yield a single confidence score, which fails to distinguish among distinct sources of uncertainty—such as knowledge gaps, output variability, and input ambiguity—thereby limiting system safety and the reliability of human-AI interaction. This work presents the first systematic disentanglement of multi-source uncertainty, introducing a novel dataset explicitly annotated with diverse uncertainty origins and employing controlled experiments to evaluate mainstream quantification approaches across varied scenarios. The study reveals that while most methods perform adequately under pure knowledge limitations, their performance degrades substantially—and can even become misleading—when confronted with other uncertainty sources. These findings underscore the inadequacy of existing techniques and highlight the necessity of developing tailored quantification strategies that account for the heterogeneous nature of uncertainty.
📝 Abstract
As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.