🤖 AI Summary
This study addresses the propensity of large language models (LLMs) to generate fabricated citations, with author-field hallucinations being particularly pronounced. Through an analysis of 108,000 generated references across nine models, the work reveals for the first time that citation hallucinations exhibit field-specific patterns and identifies a sparse set of neurons—termed FH-neurons—in Qwen2.5-32B-Instruct that are specifically associated with hallucinations in certain citation fields. Leveraging elastic net regularization combined with stability selection, along with neuron-level CETT value analysis and causal intervention experiments, the research demonstrates that activating FH-neurons exacerbates hallucinations, whereas their suppression significantly improves citation accuracy, especially for specific fields. These findings offer a novel, lightweight pathway for detecting and mitigating citation hallucinations in LLMs.
📝 Abstract
LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.