Agentic large language models improve retrieval-based radiology question answering

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Traditional single-step retrieval-augmented generation (RAG) struggles to support complex clinical reasoning in radiology question answering. To address this, we propose an agent-based multi-step RAG framework enabling large language models (LLMs) to autonomously decompose queries, iteratively retrieve from clinical knowledge sources (e.g., Radiopaedia), and dynamically integrate retrieved evidence for answer generation. This work is the first to deeply integrate agent architectures with iterative RAG, uncovering a complementary synergy between retrieval augmentation and supervised fine-tuning. Experiments demonstrate substantial improvements: diagnostic accuracy increases from 64% to 73% (+9 percentage points), with medium-scale models (e.g., Qwen2.5-7B) achieving up to a 16-percentage-point gain; hallucination rates decrease by 9.4%; and critical clinical context is successfully retrieved in 46% of cases—significantly enhancing factual consistency and reasoning robustness.

Technology Category

Application Category

📝 Abstract

Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P<0.001) and conventional online RAG (73% vs. 68%; P<0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.

Problem

Research questions and friction points this paper is trying to address.

Enhancing radiology QA with agentic RAG for complex reasoning

Improving diagnostic accuracy in mid-sized LLMs via iterative retrieval

Reducing hallucinations in radiology responses through dynamic evidence synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic RAG framework for radiology QA

Autonomous question decomposition and retrieval

Dynamic synthesis of evidence-based responses

🔎 Similar Papers

RadioRAG: Factual large language models for enhanced diagnostics in radiology using online retrieval augmented generation