Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work investigates the dynamic relationship between hallucination in large language models (LLMs) and hidden-state drift induced by incremental context injection. We conduct 16 rounds of progressive context titration experiments, integrating the TruthfulQA benchmark with a tri-perspective hallucination detector to quantitatively analyze attention dynamics and representation evolution. Our key contributions include: (i) the first identification of an “attention-locking” threshold (JS-Drift ≈ 0.69; Spearman-Drift ≈ 0), establishing a dynamic criterion for hallucination solidification; (ii) a mechanistic distinction between hallucinations triggered by relevant versus irrelevant context; and (iii) empirical validation across six open-source LLMs, revealing that hallucination rates monotonically increase with context rounds before saturating at 5–7 rounds, and exhibit a model-size-dependent trade-off—larger models demonstrate stronger contextual assimilation but reduced attention diffusion.

Technology Category

Application Category

📝 Abstract

Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round"titration"tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence"self-consistent"hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($sim0.69$) and Spearman-Drift ($sim0$) marks an"attention-locking"threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.

Problem

Research questions and friction points this paper is trying to address.

Study links hallucination to internal-state drift from context injection

Analyzes hallucination dynamics via hidden states and attention maps

Identifies thresholds where hallucinations solidify and resist correction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic study linking hallucination to internal-state drift

Tracking hallucination dynamics with tri-perspective detector

Identifying attention-locking threshold for hallucination solidification

🔎 Similar Papers

No similar papers found.