Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Confidence scores from text generation models are often poorly calibrated due to probability mass dispersion across multiple valid outputs, rendering conventional single-sequence-dependent calibration methods inadequate for estimating true accuracy. Method: We propose a fine-tuning-free, task-agnostic calibration evaluation framework that leverages intrinsic distributional properties of valid outputs in generative tasks—namely, entropy, maximum token probability, and support set size—to construct robust confidence metrics, thereby mitigating overreliance on any single decoded sequence. Contribution/Results: The framework is compatible with autoregressive models including BART and Flan-T5. Empirically, it substantially improves calibration across summarization, machine translation, and question answering—reducing Expected Calibration Error (ECE) by up to 42%. Moreover, it enhances interpretability of low-confidence predictions and facilitates human-in-the-loop intervention.

Technology Category

Application Category

📝 Abstract

Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could distribute its output probability among multiple sequences because they are all valid. We propose task-agnostic confidence metrics suited to generation, which rely solely on the probabilities associated with the model outputs without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and QA datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving calibration of confidence scores in text generation models

Addressing multiple valid answers affecting confidence metrics

Proposing task-agnostic confidence metrics without fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses output distribution characteristics for calibration

Task-agnostic metrics without fine-tuning

Improves BART and Flan-T5 calibration

🔎 Similar Papers

QA-Calibration of Language Model Confidence Scores