🤖 AI Summary
Existing uncertainty quantification (UQ) methods are predominantly evaluated in single-answer settings, neglecting data uncertainty—such as ambiguity in world knowledge or inherent stochasticity in mathematical reasoning—thereby limiting their real-world reliability. To address this gap, we introduce MAQA, the first benchmark tailored for multi-answer scenarios, and propose the first data-uncertainty-aware UQ evaluation framework. Leveraging multi-domain, multi-answer annotations and systematically evaluating five white-box and black-box UQ methods—including entropy, confidence scoring, self-consistency, log-probability, and Monte Carlo Dropout—we find that entropy and self-consistency exhibit superior robustness under mixed uncertainty; white-box methods overconfide by over 37% on mathematical and commonsense reasoning tasks; and UQ performance is highly task-dependent. This work is the first to empirically reveal and quantify how data uncertainty fundamentally impacts trustworthiness assessment of large language models, establishing a new benchmark and methodological foundation for trustworthy AI.
📝 Abstract
Although large language models (LLMs) are capable of performing various tasks, they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on questions requiring a single clear answer, ignoring the existence of data uncertainty that arises from irreducible randomness. Instead, these methods only consider model uncertainty, which arises from a lack of knowledge. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that entropy and consistency-based methods estimate the model uncertainty well even under data uncertainty, while other methods for white- and black-box LLMs struggle depending on the tasks. Additionally, methods designed for white-box LLMs suffer from overconfidence in reasoning tasks compared to simple knowledge queries. We believe our observations will pave the way for future work on uncertainty quantification in realistic setting.