Agentic retrieval-augmented reasoning reshapes collective reliability under model variability in radiology question answering

📅 2026-03-06
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of inconsistent decision-making among large language models (LLMs) in radiology question answering, where architectural heterogeneity leads to divergent reasoning that cannot be adequately captured by accuracy alone. To enhance reliability, the authors propose a radiology-specific, multi-step agentic retrieval-augmented generation (Agentic RAG) framework that leverages a structured knowledge base to standardize evidence inputs and guide heterogeneous models toward consistent reasoning. In the first systematic evaluation of its kind across 169 radiology questions, the approach significantly reduced inter-model decision entropy (from 0.48 to 0.13), improved robustness of correctness (from 0.74 to 0.81), and strengthened majority consensus (P<0.001), thereby revealing critical dimensions of stability and clinical relevance beyond conventional accuracy metrics—though high consensus does not guarantee correctness.

Technology Category

Application Category

📝 Abstract
Agentic retrieval-augmented reasoning pipelines are increasingly used to structure how large language models (LLMs) incorporate external evidence in clinical decision support. These systems iteratively retrieve curated domain knowledge and synthesize it into structured reports before answer selection. Although such pipelines can improve performance, their impact on reliability under model variability remains unclear. In real-world deployment, heterogeneous models may align, diverge, or synchronize errors in ways not captured by accuracy. We evaluated 34 LLMs on 169 expert-curated publicly available radiology questions, comparing zero-shot inference with a radiology-specific multi-step agentic retrieval condition in which all models received identical structured evidence reports derived from curated radiology knowledge. Agentic inference reduced inter-model decision dispersion (median entropy 0.48 vs. 0.13) and increased robustness of correctness across models (mean 0.74 vs. 0.81). Majority consensus also increased overall (P<0.001). Consensus strength and robust correctness remained correlated under both strategies (\r{ho}=0.88 for zero-shot; \r{ho}=0.87 for agentic), although high agreement did not guarantee correctness. Response verbosity showed no meaningful association with correctness. Among 572 incorrect outputs, 72% were associated with moderate or high clinically assessed severity, although inter-rater agreement was low (\k{appa}=0.02). Agentic retrieval therefore was associated with more concentrated decision distributions, stronger consensus, and higher cross-model robustness of correctness. These findings suggest that evaluating agentic systems through accuracy or agreement alone may not always be sufficient, and that complementary analyses of stability, cross-model robustness, and potential clinical impact are needed to characterize reliability under model variability.
Problem

Research questions and friction points this paper is trying to address.

agentic retrieval-augmented reasoning
model variability
radiology question answering
collective reliability
cross-model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic retrieval-augmented reasoning
model variability
collective reliability
radiology question answering
cross-model robustness
🔎 Similar Papers
No similar papers found.
M
Mina Farajiamiri
Lab for AI in Medicine, RWTH Aachen University, Aachen, Germany
J
Jeta Sopa
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
S
Saba Afza
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Lisa Adams
Lisa Adams
Assistant Professor of Radiology | Technical University Munich
RadiologyAIMolecular MRI
F
Felix Barajas Ordonez
Lab for AI in Medicine, RWTH Aachen University, Aachen, Germany; Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
T
Tri-Thien Nguyen
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany; Institute of Radiology, University Hospital Erlangen, Erlangen, Germany
Mahshad Lotfinia
Mahshad Lotfinia
RWTH Aachen University
Artificial IntelligenceDeep LearningMedical Image Analysis
S
Sebastian Wind
Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany; Erlangen National High Performance Computing Center, Friedrich-Alexander-Universität Erlangen-Nßrnberg, Erlangen, Germany
Keno Bressem
Keno Bressem
Technical University Munich
deep learningradiomicsmicrowave ablation
Sven Nebelung
Sven Nebelung
Department of Diagnostic and Interventional Radiology, University Hospital Aachen
Advanced MRI TechniquesFunctionality AssessmentBiomechanical ImagingCartilageArtificial Intelligence
Daniel Truhn
Daniel Truhn
Professor of Radiology, University Hospital Aachen
Machine LearningArtificial IntelligenceComputer VisionMedical Imaging
Soroosh Tayebi Arasteh
Soroosh Tayebi Arasteh
RWTH Aachen University
Deep LearningAI in MedicineGenerative AIMedical Image Analysis