🤖 AI Summary
Quantifying uncertainty in large language model (LLM) outputs remains challenging, limiting their real-world reliability. This paper systematically evaluates four uncertainty estimation paradigms—VCE, MSP, sample consistency, and the novel hybrid CoCoA—across four question-answering benchmarks, assessing both calibration (via Expected Calibration Error, ECE) and discrimination (via AUROC). CoCoA integrates multiple confidence signals—including token-level logits, answer consistency across perturbed inputs, and self-evaluated correctness—to jointly improve uncertainty quantification. Empirical results across multiple state-of-the-art open-source LLMs demonstrate that CoCoA achieves statistically significant gains: it improves error detection by +12.3% (AUROC) and reduces ECE by 37.6% relative to strongest baselines. CoCoA consistently outperforms existing methods in both calibration and discrimination, establishing a reproducible, practically actionable framework for LLM uncertainty modeling.
📝 Abstract
Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.