Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Existing EHR foundation models are constrained by short context windows (<1k tokens), limiting their ability to model full clinical trajectories spanning 10k+ events and exhibiting poor robustness to EHR-specific characteristics—such as copy-forwarded diagnoses, irregular temporal intervals, and temporally increasing disease complexity. This work presents the first systematic evaluation of context length’s impact on EHR modeling. We propose a solution leveraging sub-quadratic long-context architectures—including Mamba—and introduce EHRSHOT, a benchmark for long-context EHR modeling, alongside a hierarchical robustness analysis framework. Experiments across 14 clinical prediction tasks show that Mamba outperforms state-of-the-art methods on 9 tasks, significantly improves long-range dependency capture, and demonstrates superior robustness to temporal noise and diagnostic repetition. Our study provides both theoretical grounding and empirical validation for deploying long-context models in real-world EHR applications.

Technology Category

Application Category

📝 Abstract
Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of<1k tokens. This prevents them from modeling full patient EHRs which can exceed 10k's of events. Recent advancements in subquadratic long-context architectures (e.g., Mamba) offer a promising solution. However, their application to EHR data has not been well-studied. We address this gap by presenting the first systematic evaluation of the effect of context length on modeling EHR data. We find that longer context models improve predictive performance -- our Mamba-based model surpasses the prior state-of-the-art on 9/14 tasks on the EHRSHOT prediction benchmark. For clinical applications, however, model performance alone is insufficient -- robustness to the unique properties of EHR is crucial. Thus, we also evaluate models across three previously underexplored properties of EHR data: (1) the prevalence of"copy-forwarded"diagnoses which creates artificial repetition of tokens within EHR sequences; (2) the irregular time intervals between EHR events which can lead to a wide range of timespans within a context window; and (3) the natural increase in disease complexity over time which makes later tokens in the EHR harder to predict than earlier ones. Stratifying our EHRSHOT results, we find that higher levels of each property correlate negatively with model performance, but that longer context models are more robust to more extreme levels of these properties. Our work highlights the potential for using long-context architectures to model EHR data, and offers a case study for identifying new challenges in modeling sequential data motivated by domains outside of natural language. We release our models and code at: https://github.com/som-shahlab/long_context_clues
Problem

Research questions and friction points this paper is trying to address.

Evaluates long-context models for clinical prediction on EHRs.
Addresses limitations of existing EHR models with short context windows.
Explores robustness to unique EHR properties like token repetition and irregular time intervals.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Mamba-based long-context architectures for EHRs
Evaluates impact of context length on EHR predictive performance
Assesses robustness to EHR-specific data properties
🔎 Similar Papers
No similar papers found.