🤖 AI Summary
This work addresses uncertainty quantification (UQ) for large language models (LLMs), focusing on determining *when an LLM knows it does not know*—i.e., detecting unreliable predictions. For black-box or near-black-box settings, we propose a lightweight, output-consistency-based framework that implicitly aggregates similarity across multiple sampled generations: it leverages sequence-level similarity and diversity among outputs to train a confidence estimator with minimal supervision. The method requires no access to model internals—neither parameters nor gradients—and is compatible with diverse UQ paradigms. It generalizes across complex generative tasks, including question answering, summarization, and text-to-SQL. Empirical evaluation demonstrates consistent, significant improvements over state-of-the-art baselines across multiple benchmarks; the approach exhibits strong cross-model generalization, excellent confidence calibration, and high computational efficiency.
📝 Abstract
When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.