🤖 AI Summary
This work addresses the pervasive issue of citation hallucinations in scientific text generated by large language models, which often manifest as metadata errors or entirely fabricated references. To tackle this challenge, the authors propose CiteCheck, a novel framework that integrates academic database retrieval, a structured large language model verifier, and calibrated multi-tier decision rules to enable fine-grained detection of citation inaccuracies—ranging from minor deviations to complete fabrications. Evaluated on a benchmark comprising 982 physics citations, CiteCheck achieves a macro F1 score of 88.7 and an accuracy of 88.9%, substantially outperforming mainstream models such as GPT, Claude, and Gemini.
📝 Abstract
Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible while containing corrupted metadata or pointing to papers that do not exist. We introduce CiteCheck, a hybrid framework for citation hallucination detection that verifies whether a citation corresponds to a real scholarly work and whether its metadata is faithful to that work. CiteCheck retrieves candidate publications from external scholarly sources, compares the citation against the retrieved candidate using a structured LLM verifier, and maps verifier scores into three labels: Exact, Minor, and Major. We also construct a 982-citation physics benchmark with controlled corruptions that capture both subtle metadata drift and fully fabricated references. On the held-out test set, CiteCheck achieves 88.7 macro-F1 and 88.9% accuracy, outperforming GPT, Claude, and Gemini baselines, including web-search and few-shot variants. These results show that reliable citation verification benefits from combining scholarly retrieval, structured LLM-based comparison, and calibrated decision rules.