🤖 AI Summary
Generative question-answering (QA) systems require interpretable, fine-grained confidence scores for high-stakes decision-making; however, existing calibration methods only ensure global average calibration and fail to support reliable answer-level reliability assessment.
Method: We propose QA-calibration—a novel paradigm that formally defines subgroup-level calibration tailored to QA tasks—and design a distribution-free, discretization-based post-hoc calibration method grounded in prompt engineering to extract raw LLM confidence scores, followed by grouped statistics, binning, and empirical calibration.
Contribution/Results: Our approach provides theoretical guarantees without distributional assumptions. Evaluated on Natural Questions, TriviaQA, and models including Llama-3 and GPT-4, it reduces Expected Calibration Error (ECE) by 35–62% while preserving answer accuracy—demonstrating substantial improvements in answer-level confidence calibration without performance trade-offs.
📝 Abstract
To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned by elicitation prompts across multiple QA benchmarks and large language models (LLMs).