🤖 AI Summary
Large language models are prone to hallucination when processing linearized structured knowledge, such as graphs and tables, yet the underlying mechanisms remain poorly understood. This study systematically investigates internal model dynamics and finds that hallucinations primarily arise from attention mechanisms favoring structural shortcut cues, while feedforward layers fail to effectively anchor external knowledge and instead fall back on parametric memory. Through attention visualization, feedforward layer representation analysis, and cross-task generalization experiments across diverse structured formats—including single-hop and multi-hop graphs as well as tables—the work demonstrates a strong correlation between hallucination and semantic grounding failures in feedforward layers, reveals task-dependent attention patterns, and leverages these insights to enable effective cross-format hallucination detection.
📝 Abstract
In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge is available, LLMs can still produce hallucinated outputs, and the underlying mechanisms behind such failures remain poorly understood. We investigate these mechanisms and find that hallucinations arise from systematic internal dynamics rather than random noise. First, attention disproportionately concentrates toward shortcut-like structural cues rather than distributing across the full context. Second, feed-forward representations fail to ground the provided knowledge, causing the model to revert to parametric memory. Moreover, our results indicate that hallucination is consistently associated with failures in semantic grounding within feed-forward layers, while attention allocation exhibits greater task-dependent variability. Finally, we show that these mechanistic patterns generalize beyond single-hop graphs to multi-hop and tabular settings, enabling effective hallucination detection across structured knowledge formats.