Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing studies lack rigorous analysis of retrieval-augmented generation (RAG) in long-context (LC) clinical question answering, particularly regarding its role in medical reasoning over lengthy documents. Method: This work conducts the first empirical evaluation of RAG’s effectiveness in single- and multi-document medical QA, systematically varying model scale (7B–70B), relevance configurations, and task formats across multiple expert-curated clinical datasets. Evaluation is multidimensional, covering factual consistency, memory retention, reasoning path analysis, and fine-grained error attribution. Contribution/Results: RAG substantially improves factual consistency in LC clinical QA but exhibits failure modes—including document redundancy and relevance mismatch—that degrade performance. Crucially, smaller models with precise RAG outperform larger models in zero-shot settings. Based on these findings, we propose a clinical-domain-specific RAG adaptation strategy and a performance optimization framework that jointly addresses retrieval precision, context integration, and model-scale alignment.

Technology Category

Application Category

📝 Abstract

This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM comprehension on long-context medical question answering

Investigating RAG effects and strategies for medical LC comprehension

Analyzing limitations and failure cases in long-context medical QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated RAG strategies for long-context medical QA

Assessed single versus multi-document reasoning settings

Analyzed failure cases and RAG benefits over LC

🔎 Similar Papers

No similar papers found.