Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study investigates the root cause of hallucinations in large language models (LLMs) even when the correct answer’s semantics are already known. By introducing a semantic-level measure of answer availability and combining semantic clustering, probability distribution analysis, and hidden state inspection, systematic experiments across Qwen and Llama model families (0.8B–72B) reveal that in 16%–47% of hallucinated responses, the correct concept already carries substantial probability mass prior to generation. The findings demonstrate that hallucinations stem not from knowledge absence but from the model’s failure during the answer commitment phase to concentrate this semantic probability mass onto a single surface form—a tendency that intensifies with increasing model scale. This work provides the first mechanistic explanation of hallucination through the lens of semantic–surface form alignment, offering a novel perspective for mitigating such errors.

📝 Abstract

Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.

Problem

Research questions and friction points this paper is trying to address.

hallucination

large language models

answer commitment

semantic representation

instruction tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination

commitment failure

semantic answer availability