🤖 AI Summary
This work investigates the semantic calibration of large language models (LLMs) in open-domain question answering—i.e., their ability to assign confidence scores that meaningfully reflect answer correctness. Addressing the lack of principled semantic confidence estimation in LLMs, we propose the “B-calibration” theoretical framework, which formally establishes semantic calibration as a natural emergent property of next-token prediction and derives its sufficient conditions. Methodologically, we introduce a sampling-based definition of semantic confidence, local loss optimality analysis, equivalence-class partitioning, and distributional prediction validation. Experiments demonstrate that base LLMs exhibit robust, task-agnostic semantic calibration; however, both RL-based instruction tuning and chain-of-thought reasoning significantly degrade this property. Our findings provide a novel theoretical foundation and empirical evidence for trustworthy LLM evaluation.
📝 Abstract
Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of"B-calibration,"which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.