Language Model Probabilities are Not Calibrated in Numeric Contexts

📅 2024-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work reveals severe miscalibration in large language models’ (LLMs) probabilistic outputs for numerical reasoning—specifically, their inability to appropriately calibrate confidence scores across answer options according to implicit contextual probability distributions (e.g., uniformity, dice-combination frequencies), thereby challenging the common assumption that output probabilities directly reflect reliability. Method: We introduce the first evaluation framework tailored to probabilistic calibration in numerical contexts, integrating controllable prompting templates, polynomial enumeration of outcomes, empirical frequency estimation, and bias attribution analysis. We systematically evaluate models including GPT-4o-mini and Llama-3.1-8B. Contribution/Results: We identify stable, cross-model systematic biases—such as positional preference and lexical frequency interference—with expected calibration error (ECE) significantly exceeding human baselines. Crucially, calibration performance does not improve with model scale. This is the first empirical demonstration of pervasive failure in LLMs’ numerical probability calibration, providing foundational evidence for trustworthy reasoning and uncertainty-aware modeling.

Technology Category

Application Category

📝 Abstract
Some statements have one well-defined continuation (e.g.,"the Eiffel Tower is in [Paris]"), whereas others have a natural distribution over multiple options (e.g.,"the weighted coin flip was [Heads/Tails].") We argue that language model (LM) outputs should capture these natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts. For example, if the context (the prompt) concerns two equally likely options (e.g., heads or tails for a fair coin), the LM output probabilities should also be equal. Likewise, in a context with nonuniformly likely events (e.g., rolling a pair with two dice) an LM should output proportionate probabilities. However, we find that even in simple settings, the best LMs (1) are poorly calibrated and (2) have systematic biases: artifacts like word identity, word order, and word frequency all impact calibration. For example, gpt-4o-mini often picks the first of two options presented in the prompt regardless of the options' implied likelihoods, whereas Llama-3.1-8B picks the second. Models do not allocate probability mass among valid options in a calibrated manner.
Problem

Research questions and friction points this paper is trying to address.

Language models poorly calibrated in numeric contexts.
Systematic biases affect LM probability outputs.
LMs fail to allocate probabilities proportionately to likelihoods.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tests LM output probabilities calibration.
Identifies systematic biases in LMs.
Analyzes impact of word identity on calibration.
🔎 Similar Papers
No similar papers found.