CARE-RAG - Clinical Assessment and Reasoning in RAG

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a “retrieval-reasoning disconnection” problem in large language models (LLMs) deployed in clinical settings: even when authoritative clinical guidelines—such as Written Exposure Therapy (WET)—are accurately retrieved, LLMs frequently generate reasoning outputs that violate structured protocol requirements. To address this, we introduce the first RAG evaluation framework specifically designed for clinical protocol adherence, quantifying reasoning quality along three dimensions—accuracy, consistency, and fidelity. Leveraging an expert-validated WET question set and corresponding guideline documents, we conduct a systematic empirical analysis across leading LLMs. Results show that while current RAG systems constrain output format, critical clinical reasoning errors persist at a high rate of 32%. Our key contribution is the formal integration of reasoning processes—alongside retrieval—into rigorous evaluation. We further provide a reproducible benchmark and actionable improvement pathways for structured medical decision-making.

Technology Category

Application Category

📝 Abstract
Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
Problem

Research questions and friction points this paper is trying to address.

Bridging the gap between evidence retrieval and clinical reasoning in LLMs
Ensuring model outputs align with structured clinical therapy protocols
Assessing reasoning accuracy and consistency alongside retrieval quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

CARE-RAG framework evaluates clinical reasoning accuracy
Measures consistency and fidelity in therapy guideline adherence
Assesses reasoning rigor alongside retrieval in clinical RAG
D
Deepthi Potluri
Department of Computer Science, University of Texas at Austin
A
Aby Mammen Mathew
Department of Computer Science, University of Texas at Austin
A
Alexander L. Rasgon
Behavioral Science and Psychiatry, University of Texas at Austin
J
Jeffrey B DeWitt
Department of Computer Science, University of Texas at Austin
Y
Yide Hao
Department of Statistics, University of Michigan
J
Junyuan Hong
School of Information, University of Texas at Austin
Ying Ding
Ying Ding
Bill & Lewis Suit Professor, School of Information, Dell Med, University of Texas at Austin
AI in HealthKnowledge GraphScience of Science