Measuring Language Model Hallucinations Through Distributional Correctness

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional language model evaluation relies on single-answer accuracy or binary scoring, ignoring the model’s full output probability distribution over answer options and thus failing to distinguish harmful hallucinations (confidently incorrect predictions) from benign abstention (deliberate refusal to answer). Method: We propose Distributional Correctness Scoring (DCS), the first evaluation metric to explicitly incorporate uncertainty modeling by quantifying the alignment between the model’s predicted probability distribution and the ground-truth answer distribution. We design six DCS variants and evaluate six major model families across twelve benchmarks. Contribution/Results: Experiments reveal that all models achieve negative DCS on half of the benchmarks—evidence of pervasive systematic hallucination. DCS demonstrates superior sensitivity, interpretability, and conceptual novelty over existing metrics, enabling fine-grained discrimination between hallucination and principled abstention and advancing the foundations of robust model evaluation.

Technology Category

Application Category

📝 Abstract
Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination.
Problem

Research questions and friction points this paper is trying to address.

Evaluating language model hallucinations via distributional correctness metrics
Distinguishing harmful overconfidence from uncertainty in model responses
Addressing limitations of binary scoring in capturing model belief states
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Distributional Correctness Score metric
Evaluates model probability distribution over answers
Distinguishes harmful overconfidence from uncertainty
🔎 Similar Papers
No similar papers found.