🤖 AI Summary
Large language models (LLMs) exhibit strong generative capabilities but frequently produce outputs containing bias, hallucinations, and factual inaccuracies—highlighting the critical need for reliable uncertainty estimation. To address this gap, we present the first systematic survey of uncertainty estimation for LLMs, proposing a unified four-dimensional taxonomy: sampling-based, logits/probability-based, reasoning-process-based, and external-calibration-based methods. We establish a reproducible benchmarking protocol and conduct extensive empirical evaluation across 12 state-of-the-art LLMs and 6 diverse datasets. Our results delineate the effectiveness boundaries of each method class in detecting hallucinations, biases, and factual errors. This work fills a key void in the literature by providing the first comprehensive survey of LLM uncertainty estimation, offering both a methodological foundation for trustworthy AI and concrete directions for future research.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, these models could offer biased, hallucinated, or non-factual responses camouflaged by their fluency and realistic appearance. Uncertainty estimation is the key method to address this challenge. While research efforts in uncertainty estimation are ramping up, there is a lack of comprehensive and dedicated surveys on LLM uncertainty estimation. This survey presents four major avenues of LLM uncertainty estimation. Furthermore, we perform extensive experimental evaluations across multiple methods and datasets. At last, we provide critical and promising future directions for LLM uncertainty estimation.