Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the overlooked sensitivity of large language model (LLM) confidence calibration evaluations to measurement protocols, particularly in comparing token-level probabilities and verbalized confidence. Through systematic controlled experiments across four question-answering benchmarks, the authors investigate how protocol choices—such as answer string selection, token probability extraction methods, and conditional context formulation—affect calibration assessments for three open-source 7–8B models and their Qwen2.5 variants. The findings reveal that, under default protocols, verbalized confidence offers no significant calibration advantage over token probabilities, and that incorrect yet superficially plausible answers often receive confidence scores comparable to correct ones. The work underscores the high protocol dependence of confidence signals, advocates treating them as protocol-contingent behavioral measurements, and proposes a standardized reporting checklist to enhance reproducibility and comparability in calibration evaluation.
📝 Abstract
LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.
Problem

Research questions and friction points this paper is trying to address.

confidence calibration
protocol sensitivity
large language models
token probability
verbalized confidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence calibration
protocol sensitivity
verbalized confidence
token-probability scores
measurement protocol
🔎 Similar Papers