Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the propensity of large language models (LLMs) to generate fabricated citations, with author-field hallucinations being particularly pronounced. Through an analysis of 108,000 generated references across nine models, the work reveals for the first time that citation hallucinations exhibit field-specific patterns and identifies a sparse set of neurons—termed FH-neurons—in Qwen2.5-32B-Instruct that are specifically associated with hallucinations in certain citation fields. Leveraging elastic net regularization combined with stability selection, along with neuron-level CETT value analysis and causal intervention experiments, the research demonstrates that activating FH-neurons exacerbates hallucinations, whereas their suppression significantly improves citation accuracy, especially for specific fields. These findings offer a novel, lightweight pathway for detecting and mitigating citation hallucinations in LLMs.

Technology Category

Application Category

📝 Abstract

LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.

Problem

Research questions and friction points this paper is trying to address.

citation hallucination

large language models

fictitious citations

field-specific hallucination

author name errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination neurons

field-specific hallucination

causal intervention