🤖 AI Summary
Large language models (LLMs) suffer from factual inaccuracies in radiological diagnosis due to outdated knowledge. To address this, we propose an end-to-end, real-time, radiology-specific RAG framework that dynamically crawls authoritative sources (e.g., Radiopaedia), integrates semantic retrieval with zero-shot prompt engineering, and supports plug-and-play integration of multiple LLMs (GPT, Mistral, Llama3). Our key contributions are: (1) overcoming static knowledge base limitations by enabling continuous, real-time clinical knowledge updating; and (2) the first systematic characterization of RAG’s performance gains across radiological subspecialties—revealing significant model heterogeneity, particularly in breast and emergency radiology. Evaluated on the RSNA case dataset and an expert-annotated question bank, our framework achieves up to 54% absolute improvement in diagnostic accuracy over non-RAG baselines, outperforming both non-augmented LLMs and human radiologists—most notably in breast and emergency imaging tasks.
📝 Abstract
Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time. Accuracy was investigated. Statistical analyses were performed using bootstrapping. The results were further compared with human performance. RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy increases ranging up to 54% for different LLMs. It matched or exceeded non-RAG models and the human radiologist in question answering across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, the degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in RadioRAG's effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.