Revisiting Uncertainty Estimation and Calibration of Large Language Models

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study addresses the robustness of uncertainty estimation in large language models (LLMs) to enable trustworthy deployment in high-stakes applications. On the MMLU-Pro benchmark, we systematically evaluate 80 open- and closed-source models (0.6B–671B parameters, diverse architectures) using three black-box, single-forward-pass uncertainty methods: token-probability-based (TPU), numerically verbalized (NVU), and linguistically verbalized (LVU). We find—firstly—that LVU consistently outperforms TPU and NVU in interpretability, calibration, and discrimination. Moreover, reasoning tasks yield more reliable uncertainty estimates than knowledge-intensive ones, and accuracy does not strongly correlate with reliability. We further demonstrate significant impacts of model scale, post-training, reasoning capability, and quantization on uncertainty quality. Finally, we propose a multidimensional evaluation framework, offering both theoretical insights and practical guidance for LLM uncertainty modeling and trustworthy deployment.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and non-reasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating uncertainty estimation methods in large language models

Assessing calibration and performance across diverse model architectures

Identifying reliable uncertainty indicators for real-world LLM deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates 80 LLM models comprehensively

Focuses on three black-box uncertainty methods

LVU outperforms TPU and NVU consistently

🔎 Similar Papers

Calibration in Deep Learning: A Survey of the State-of-the-Art