Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a systematic misalignment between token-level output determinism in large language models (LLMs) and theoretically grounded probability distributions in probabilistic reasoning tasks: even when semantic responses are fully correct (100% accuracy), the logits over generated tokens significantly deviate from Bayesian posterior distributions. Using GPT-4.1 and DeepSeek-Chat, we analyze token-level logits across multiple prompt iterations, quantify uncertainty via entropy, and compare empirical token distributions against normative probabilistic constraints. Results demonstrate that prevailing uncertainty quantification (UQ) methods fail to ensure token-level calibration, exposing a fundamental tension between semantic correctness and probabilistic coherence. To our knowledge, this is the first empirical characterization of the “determinism–probability mismatch” phenomenon—where deterministic token selection contradicts principled uncertainty representation. Our work establishes a new benchmark for trustworthy probabilistic reasoning and motivates theoretical reexamination of UQ in autoregressive LMs.

Technology Category

Application Category

📝 Abstract
Reliable uncertainty quantification (UQ) is essential for ensuring trustworthy downstream use of large language models, especially when they are deployed in decision-support and other knowledge-intensive applications. Model certainty can be estimated from token logits, with derived probability and entropy values offering insight into performance on the prompt task. However, this approach may be inadequate for probabilistic scenarios, where the probabilities of token outputs are expected to align with the theoretical probabilities of the possible outcomes. We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios. Using GPT-4.1 and DeepSeek-Chat, we evaluate model responses to ten prompts involving probability (e.g., roll a six-sided die), both with and without explicit probability cues in the prompt (e.g., roll a fair six-sided die). We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities. Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.
Problem

Research questions and friction points this paper is trying to address.

Differentiating certainty from probability in LLM token outputs
Evaluating alignment with theoretical probability distributions in probabilistic scenarios
Assessing token-level probability divergence from expected theoretical distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiates token certainty from probability alignment
Evaluates models using probabilistic scenarios with cues
Measures token-level probability divergence from theoretical distributions
🔎 Similar Papers
No similar papers found.