Revisiting the MIMIC-IV Benchmark: Experiments Using Language Models for Electronic Health Records

📅 2025-04-29
🏛️ CL4HEALTH
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
The absence of standardized, text-based electronic health record (EHR) evaluation benchmarks impedes fair comparison and deployment of large language models (LLMs) in clinical downstream tasks. To address this, we introduce the first open-source, Hugging Face–native textual EHR benchmark—systematically transforming MIMIC-IV’s structured clinical data into natural-language sequences via template engineering. The benchmark supports zero-shot prompting, supervised fine-tuning (e.g., Llama-3, Phi-3), and comparison with traditional models. Our key contributions include the first standardized, open-access textual reconstruction of MIMIC-IV and its integration into the Hugging Face ecosystem. Empirical results show that fine-tuned textual models achieve an AUC of 0.86 on in-hospital mortality prediction—matching state-of-the-art tabular models (XGBoost, logistic regression)—thereby validating the viability of the textual pathway. In contrast, zero-shot LLMs underperform markedly (AUC < 0.6), underscoring the critical impact of domain-specific adaptation and data representation on LLM efficacy in healthcare.

Technology Category

Application Category

📝 Abstract
The lack of standardized evaluation benchmarks in the medical domain for text inputs can be a barrier to widely adopting and leveraging the potential of natural language models for health-related downstream tasks. This paper revisited an openly available MIMIC-IV benchmark for electronic health records (EHRs) to address this issue. First, we integrate the MIMIC-IV data within the Hugging Face datasets library to allow an easy share and use of this collection. Second, we investigate the application of templates to convert EHR tabular data to text. Experiments using fine-tuned and zero-shot LLMs on the mortality of patients task show that fine-tuned text-based models are competitive against robust tabular classifiers. In contrast, zero-shot LLMs struggle to leverage EHR representations. This study underlines the potential of text-based approaches in the medical field and highlights areas for further improvement.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized medical text evaluation benchmarks
Converting EHR tabular data to text using templates
Assessing fine-tuned vs zero-shot LLMs on patient mortality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate MIMIC-IV data with Hugging Face
Convert EHR tabular data to text
Fine-tuned text models outperform tabular classifiers
🔎 Similar Papers
No similar papers found.
J
Jesús Lovón-Melgarejo
University of Toulouse, IRIT, 31000 Toulouse, France
T
Thouria Ben-Haddi
University of Toulouse, IRIT, 31000 Toulouse, France
J
Jules Di Scala
University of Toulouse, IRIT, 31000 Toulouse, France
J
José G. Moreno
University of Toulouse, IRIT, 31000 Toulouse, France
Lynda Tamine
Lynda Tamine
Professor in computer science, University of Toulouse, IRIT lab. , France
Information retrieval