QA-Calibration of Language Model Confidence Scores

📅 2024-10-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Generative question-answering (QA) systems require interpretable, fine-grained confidence scores for high-stakes decision-making; however, existing calibration methods only ensure global average calibration and fail to support reliable answer-level reliability assessment. Method: We propose QA-calibration—a novel paradigm that formally defines subgroup-level calibration tailored to QA tasks—and design a distribution-free, discretization-based post-hoc calibration method grounded in prompt engineering to extract raw LLM confidence scores, followed by grouped statistics, binning, and empirical calibration. Contribution/Results: Our approach provides theoretical guarantees without distributional assumptions. Evaluated on Natural Questions, TriviaQA, and models including Llama-3 and GPT-4, it reduces Expected Calibration Error (ECE) by 35–62% while preserving answer accuracy—demonstrating substantial improvements in answer-level confidence calibration without performance trade-offs.

Technology Category

Application Category

📝 Abstract

To use generative question-and-answering (QA) systems for decision-making and in any critical application, these systems need to provide well-calibrated confidence scores that reflect the correctness of their answers. Existing calibration methods aim to ensure that the confidence score is, *on average*, indicative of the likelihood that the answer is correct. We argue, however, that this standard (average-case) notion of calibration is difficult to interpret for decision-making in generative QA. To address this, we generalize the standard notion of average calibration and introduce QA-calibration, which ensures calibration holds across different question-and-answer groups. We then propose discretized posthoc calibration schemes for achieving QA-calibration. We establish distribution-free guarantees on the performance of this method and validate our method on confidence scores returned by elicitation prompts across multiple QA benchmarks and large language models (LLMs).

Problem

Research questions and friction points this paper is trying to address.

Improving confidence score calibration in generative QA systems.

Introducing QA-calibration for better decision-making accuracy.

Validating calibration methods across multiple QA benchmarks and LLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces QA-calibration for generative QA systems

Proposes discretized posthoc calibration schemes

Validates method across multiple QA benchmarks

🔎 Similar Papers

No similar papers found.