🤖 AI Summary
Existing studies lack rigorous analysis of retrieval-augmented generation (RAG) in long-context (LC) clinical question answering, particularly regarding its role in medical reasoning over lengthy documents.
Method: This work conducts the first empirical evaluation of RAG’s effectiveness in single- and multi-document medical QA, systematically varying model scale (7B–70B), relevance configurations, and task formats across multiple expert-curated clinical datasets. Evaluation is multidimensional, covering factual consistency, memory retention, reasoning path analysis, and fine-grained error attribution.
Contribution/Results: RAG substantially improves factual consistency in LC clinical QA but exhibits failure modes—including document redundancy and relevance mismatch—that degrade performance. Crucially, smaller models with precise RAG outperform larger models in zero-shot settings. Based on these findings, we propose a clinical-domain-specific RAG adaptation strategy and a performance optimization framework that jointly addresses retrieval precision, context integration, and model-scale alignment.
📝 Abstract
This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.