đ¤ AI Summary
Traditional single-step retrieval-augmented generation (RAG) struggles to support complex clinical reasoning in radiology question answering. To address this, we propose an agent-based multi-step RAG framework enabling large language models (LLMs) to autonomously decompose queries, iteratively retrieve from clinical knowledge sources (e.g., Radiopaedia), and dynamically integrate retrieved evidence for answer generation. This work is the first to deeply integrate agent architectures with iterative RAG, uncovering a complementary synergy between retrieval augmentation and supervised fine-tuning. Experiments demonstrate substantial improvements: diagnostic accuracy increases from 64% to 73% (+9 percentage points), with medium-scale models (e.g., Qwen2.5-7B) achieving up to a 16-percentage-point gain; hallucination rates decrease by 9.4%; and critical clinical context is successfully retrieved in 46% of casesâsignificantly enhancing factual consistency and reasoning robustness.
đ Abstract
Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P<0.001) and conventional online RAG (73% vs. 68%; P<0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.