A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
Existing approaches struggle to effectively evaluate the calibration of large language models in open-ended question answering, as they often rely on constrained output formats, internal probability estimates, or task-specific heuristics. This work proposes Sem-ECE, a novel framework that introduces semantic sampling for calibration error estimation. By clustering model-generated answers based on semantic similarity and using cluster frequencies to estimate confidence, Sem-ECE constructs two unbiased estimators—Sem₁-ECE and Sem₂-ECE—and leverages their discrepancy to diagnose question difficulty. Experiments across three open-ended QA benchmarks and five prominent commercial large language models demonstrate that Sem-ECE substantially outperforms verbalized confidence and existing sampling-based methods, offering a robust calibration assessment even when internal logits are unavailable.
📝 Abstract
Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.
Problem

Research questions and friction points this paper is trying to address.

calibration
open-ended question answering
large language models
evaluation framework
semantic sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Sampling
Calibration Evaluation
Open-Ended QA
Expected Calibration Error
LLM Reliability
🔎 Similar Papers