Do Deployment Constraints Make LLMs Hallucinate Citations? An Empirical Study across Four Models and Five Prompting Regimes

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the critical issue of hallucinated citations generated by large language models (LLMs) in academic writing, which undermines their reliability in high-stakes contexts such as evidence synthesis in software engineering. The authors construct a closed-book test set comprising 144 claims and develop an automated verification pipeline leveraging Crossref and Semantic Scholar, supplemented by manual auditing. For the first time, they systematically quantify how five deployment constraints—including temporal windows and non-disclosure policies—affect citation accuracy across four leading LLMs. Results reveal that all models exhibit citation validity rates below 0.475, with temporal and combined constraints causing the most pronounced declines. Between 36% and 61% of generated citations are unverifiable, and spot checks confirm many are entirely fabricated. The findings underscore the necessity of post-hoc validation prior to using LLM-generated literature reviews.

Technology Category

Application Category

📝 Abstract

LLMs are increasingly used to draft academic text and to support software engineering (SE) evidence synthesis, but they often hallucinate bibliographic references that look legitimate. We study how deployment-motivated prompting constraints affect citation verifiability in a closed-book setting. Using 144 claims (24 in SE&CS) and a deterministic verification pipeline (Crossref + Semantic Scholar), we evaluate two proprietary models (Claude Sonnet, GPT-4o) and two open-weight models (LLaMA~3.1-8B, Qwen~2.5-14B) across five regimes: Baseline, Temporal (publication-year window), Survey-style breadth, Non-Disclosure policy, and their combination. Across 17,443 generated citations, no model exceeds a citation-level existence rate of 0.475; Temporal and Combo conditions produce the steepest drops while outputs remain format-compliant (well-formed bibliographic fields). Unresolved outcomes dominate (36-61%); a 100-citation audit indicates that a substantial fraction of Unresolved cases are fabricated. Results motivate post-hoc citation verification before LLM outputs enter SE literature reviews or tooling pipelines.

Problem

Research questions and friction points this paper is trying to address.

hallucination

citation

large language models

evidence synthesis

verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

citation hallucination

deployment constraints

prompting regimes