🤖 AI Summary
Electronic health record (EHR) texts are excessively long and noisy, frequently exceeding the context windows of mainstream large language models (LLMs), thereby impeding clinical reasoning. To address this, we systematically evaluate retrieval-augmented generation (RAG) against full-context input across clinical tasks. We introduce three reproducible, multi-institutional clinical tasks—key information extraction, temporal event modeling, and core diagnosis identification—and conduct experiments using three state-of-the-art LLMs, integrating targeted text retrieval with recent clinical notes as input. Results show that RAG achieves comparable or superior performance to full-context baselines across most tasks—improving average F1 by 2.3–5.1 percentage points—while consuming only ~15% of the input tokens. This yields substantial reductions in computational cost and enhances practical deployability. To our knowledge, this is the first study to empirically validate, within a unified framework, RAG’s efficiency and robustness for long-context EHR reasoning, establishing a lightweight, scalable paradigm for clinical LLM deployment.
📝 Abstract
Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models' extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models' full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.