Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of citation accuracy among large language models (LLMs) in the medical domain. It presents the first multidimensional quantitative analysis of five widely used, freely accessible LLMs in retrieving references from high-impact medical journals. Performance was assessed using a composite score based on the validity and relevance of DOIs, PubMed IDs (PMIDs), and Google Scholar links, supplemented by complete omission rates and multivariable regression to isolate the independent effects of platform and journal type. Results reveal an average complete failure rate of 47.8%, with overall low accuracy and significant inter-platform variation—Grok achieving the highest accuracy score of 0.57 and Gemini the lowest at 0.11. Notably, retrieval performance was poorest for articles from The New England Journal of Medicine.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) assisted literature retrieval may lead to erroneous references, but these errors have not been rigorously quantified. Therefore, we quantitatively assess errors in reference retrieval of widely used free-version LLM platforms and identify the factors associated with retrieval errors. We evaluated 2,000 references retrieved by 5 LLMs (Grok-2, ChatGPT GPT-4.1, Google Gemini Flash 2.5, Perplexity AI, and DeepSeek GPT-4) for 40 randomly-selected original articles (10 per journal) published Jan. 2024 to July 2025 from British Medical Journal (BMJ), Journal of the American Medical Association, and The New England Journal of Medicine (NEJM). Primary outcomes were a multimetric score ratio combining validity of digital object identifier, PubMed ID, Google-Scholar link, and relevance; and complete miss rate (proportion of references failing all applicable metrics). Multivariable regression was used to examine independent associations. LLM platforms completely failed to retrieve correct reference data 47.8% of the time. The average score ratio of the 5 LLM platforms was 0.29 (standard deviation, 0.35; range, 0-1.25), with a higher score ratio indicating a higher accuracy in retrieving relevant references and correct bibliographic data. The highest and lowest accuracies were achieved by Grok (0.57) and Genimi (0.11), respectively. Compared with BMJ, NEJM articles had lower score ratios and higher complete miss rates. Multivariable analysis shows LLM platforms and journals were independently associated with score ratios and complete miss rate, respectively. We show modest overall performance of LLMs and significant variability in retrieval accuracy across platforms and journals. LLM platforms and journals are associated with LLM's performance in retrieving medical literature. Bibliographic data should be carefully reviewed when using LLM-assisted literature retrieval.

Problem

Research questions and friction points this paper is trying to address.

AI-assisted retrieval

medical literature

reference errors

large language models

bibliographic accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models

medical literature retrieval

reference accuracy