From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer hallucinations—ungrounded, factually incorrect generations—pose critical trust and safety risks in high-stakes applications, particularly under input uncertainty, where models activate semantically coherent yet irrelevant concepts. Method: We propose a sparse autoencoder–based conceptual representation analysis framework to dissect intermediate-layer activations; integrating controlled noise perturbations and targeted steering techniques, we identify early-layer conceptual patterns predictive of hallucination. Contribution/Results: We empirically demonstrate, for the first time, that mid-layer concept activations robustly predict final-output hallucinations, enabling automatic, quantitative hallucination risk assessment. Our analysis uncovers systematic semantic concept expansion under unstructured inputs, revealing interpretable, measurable failure modes. This work establishes a novel, explainable, and quantifiable paradigm for AI safety, alignment, and trustworthy generation.

Technology Category

Application Category

📝 Abstract
As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.
Problem

Research questions and friction points this paper is trying to address.

Investigating origins of hallucinations in transformer models
Analyzing how input uncertainty triggers model hallucinations
Predicting hallucinations from concept patterns in activations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders analyze transformer concept representations
Identify hallucination triggers under controlled input uncertainty
Predict hallucinations from concept patterns in activations
🔎 Similar Papers
No similar papers found.