🤖 AI Summary
Assessing response confidence for black-box large language models (LLMs) remains challenging due to lack of access to internal parameters or gradients.
Method: This paper proposes a lightweight, interpretable confidence estimation framework that requires only model query outputs—no model internals. It leverages handcrafted, general-purpose response features (e.g., token entropy, self-consistency, length ratio) and employs logistic regression for modeling, ensuring plug-and-play deployment and strong interpretability.
Contribution/Results: The key innovation is discovering cross-model zero-shot generalization: a confidence model trained on a single LLM transfers effectively to diverse unseen LLMs—including Flan-UL2, Llama-13B, Mistral-7B, GPT-4, Pegasus-large, and BART-large—without fine-tuning. Evaluated across four question-answering and two summarization tasks, the method achieves an average AUROC improvement exceeding 10% over state-of-the-art baselines, demonstrating substantial gains in calibration and reliability.
📝 Abstract
Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.