LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Large language models (LLMs) exhibit pervasive overconfidence in numerical estimation tasks and struggle to produce well-calibrated confidence intervals. Method: We introduce FermiEval, the first systematic benchmark for evaluating LLMs’ calibration in uncertainty quantification, grounded in the “perceptual tunnel” theory that explains their cognitive bias toward ignoring distributional tails. Our approach integrates conformal prediction, direct log-probability elicitation, and quantile adjustment, rigorously assessed via dual metrics—coverage probability and interval sharpness. Results: Calibration improves nominal 99% confidence intervals’ empirical coverage from severe undercoverage to 99%, while reducing Winkler scores by 54%, substantially mitigating overconfidence. This work establishes a reproducible evaluation framework and an effective calibration paradigm for uncertainty modeling in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99% intervals cover the true answer only 65% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99% observed coverage, and the Winkler interval score decreases by 54%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.

Problem

Research questions and friction points this paper is trying to address.

LLMs are systematically overconfident in quantifying uncertainty around numerical estimates

Current models construct poorly calibrated confidence intervals that significantly undercover true answers

LLMs exhibit perception-tunnel reasoning by neglecting distribution tails during uncertainty estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal prediction adjusts intervals for accurate coverage

Log-probability elicitation reduces high-level confidence overestimation

Quantile adjustment methods mitigate tail neglect in distributions

🔎 Similar Papers

Does Alignment Tuning Really Break LLMs' Internal Confidence?