Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study

๐Ÿ“… 2026-03-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the lack of systematic evaluation of citation accuracy among large language models (LLMs) in the medical domain. It presents the first multidimensional quantitative analysis of five widely used, freely accessible LLMs in retrieving references from high-impact medical journals. Performance was assessed using a composite score based on the validity and relevance of DOIs, PubMed IDs (PMIDs), and Google Scholar links, supplemented by complete omission rates and multivariable regression to isolate the independent effects of platform and journal type. Results reveal an average complete failure rate of 47.8%, with overall low accuracy and significant inter-platform variationโ€”Grok achieving the highest accuracy score of 0.57 and Gemini the lowest at 0.11. Notably, retrieval performance was poorest for articles from The New England Journal of Medicine.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) assisted literature retrieval may lead to erroneous references, but these errors have not been rigorously quantified. Therefore, we quantitatively assess errors in reference retrieval of widely used free-version LLM platforms and identify the factors associated with retrieval errors. We evaluated 2,000 references retrieved by 5 LLMs (Grok-2, ChatGPT GPT-4.1, Google Gemini Flash 2.5, Perplexity AI, and DeepSeek GPT-4) for 40 randomly-selected original articles (10 per journal) published Jan. 2024 to July 2025 from British Medical Journal (BMJ), Journal of the American Medical Association, and The New England Journal of Medicine (NEJM). Primary outcomes were a multimetric score ratio combining validity of digital object identifier, PubMed ID, Google-Scholar link, and relevance; and complete miss rate (proportion of references failing all applicable metrics). Multivariable regression was used to examine independent associations. LLM platforms completely failed to retrieve correct reference data 47.8% of the time. The average score ratio of the 5 LLM platforms was 0.29 (standard deviation, 0.35; range, 0-1.25), with a higher score ratio indicating a higher accuracy in retrieving relevant references and correct bibliographic data. The highest and lowest accuracies were achieved by Grok (0.57) and Genimi (0.11), respectively. Compared with BMJ, NEJM articles had lower score ratios and higher complete miss rates. Multivariable analysis shows LLM platforms and journals were independently associated with score ratios and complete miss rate, respectively. We show modest overall performance of LLMs and significant variability in retrieval accuracy across platforms and journals. LLM platforms and journals are associated with LLM's performance in retrieving medical literature. Bibliographic data should be carefully reviewed when using LLM-assisted literature retrieval.
Problem

Research questions and friction points this paper is trying to address.

AI-assisted retrieval
medical literature
reference errors
large language models
bibliographic accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models
medical literature retrieval
reference accuracy
multimetric evaluation
bibliographic error
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jenny Gao
College of Arts and Science, New York University, New York, NY 10003
Yongfeng Zhang
Yongfeng Zhang
Professor of Computer Science, Rutgers University
Machine LearningInformation RetrievalRecommender SystemNLPMLSys
M
Mary L Disis
UW Medicine Cancer Vaccine Institute University of Washington, Seattle, WA 98109, United States
Lanjing Zhang
Lanjing Zhang
Princeton Medical Center and Rutgers University
PathologyMachine learning/AIEpidemiologyCancerLiver