Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work investigates the semantic calibration of large language models (LLMs) in open-domain question answering—i.e., their ability to assign confidence scores that meaningfully reflect answer correctness. Addressing the lack of principled semantic confidence estimation in LLMs, we propose the “B-calibration” theoretical framework, which formally establishes semantic calibration as a natural emergent property of next-token prediction and derives its sufficient conditions. Methodologically, we introduce a sampling-based definition of semantic confidence, local loss optimality analysis, equivalence-class partitioning, and distributional prediction validation. Experiments demonstrate that base LLMs exhibit robust, task-agnostic semantic calibration; however, both RL-based instruction tuning and chain-of-thought reasoning significantly degrade this property. Our findings provide a novel theoretical foundation and empirical evidence for trustworthy LLM evaluation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of"B-calibration,"which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

Problem

Research questions and friction points this paper is trying to address.

LLMs lack meaningful confidence estimates for their semantic outputs

Semantic calibration emerges as a byproduct of next-token prediction

Instruction-tuning and chain-of-thought reasoning systematically break this calibration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic calibration emerges from next-token prediction

B-calibration theory explains semantic equivalence confidence

Base LLMs achieve calibration without explicit training

🔎 Similar Papers

Does Alignment Tuning Really Break LLMs' Internal Confidence?

2024-08-31arXiv.orgCitations: 0

💼 Related Jobs

No related jobs found.

Authors to Follow