MIMIC-RNum{4}-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp for Risk Prediction

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Clinical event temporal information is frequently absent in unstructured discharge summaries, and modeling long clinical texts remains challenging. Method: We propose a chunk-filtering framework integrating contextual BM25 retrieval with semantic search, coupled with implicit timestamp inference using Llama-3.1-8B, enabling the first fully automated extraction of clinically events annotated with relative timestamps from MIMIC-IV discharge summaries. Contribution/Results: We construct a high-quality temporal clinical dataset comprising 22,588,586 events, substantially supporting downstream risk prediction modeling. Fine-tuning BERT yields +10% accuracy on medical question answering and +3% improvement in clinical trial matching; fine-tuned GPT-2 demonstrates markedly enhanced clinical reliability in generated text. This work establishes a scalable, methodology-driven paradigm for temporal clinical NLP data curation and model development.

Technology Category

Application Category

📝 Abstract

Clinical risk prediction based on machine learning algorithms plays a vital role in modern healthcare. A crucial component in developing a reliable prediction model is collecting high-quality time series clinical events. In this work, we release such a dataset that consists of 22,588,586 Clinical Time Series events, which we term MIMIC-RNum{4}-Ext-22MCTS. Our source data are discharge summaries selected from the well-known yet unstructured MIMIC-IV-Note cite{Johnson2023-pg}. We then extract clinical events as short text span from the discharge summaries, along with the timestamps of these events as temporal information. The general-purpose MIMIC-IV-Note pose specific challenges for our work: it turns out that the discharge summaries are too lengthy for typical natural language models to process, and the clinical events of interest often are not accompanied with explicit timestamps. Therefore, we propose a new framework that works as follows: 1) we break each discharge summary into manageably small text chunks; 2) we apply contextual BM25 and contextual semantic search to retrieve chunks that have a high potential of containing clinical events; and 3) we carefully design prompts to teach the recently released Llama-3.1-8B cite{touvron2023llama} model to identify or infer temporal information of the chunks. We show that the obtained dataset is so informative and transparent that standard models fine-tuned on our dataset are achieving significant improvements in healthcare applications. In particular, the BERT model fine-tuned based on our dataset achieves 10% improvement in accuracy on medical question answering task, and 3% improvement in clinical trial matching task compared with the classic BERT. The GPT-2 model, fine-tuned on our dataset, produces more clinically reliable results for clinical questions.

Problem

Research questions and friction points this paper is trying to address.

Extracting clinical events from lengthy discharge summaries

Inferring timestamps for clinical events without explicit temporal data

Improving risk prediction models with high-quality time-series data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunking discharge summaries for NLP processing

Using BM25 and semantic search for event retrieval

Prompting Llama-3.1-8B to infer temporal information

🔎 Similar Papers

EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation