Revisiting the MIMIC-IV Benchmark: Experiments Using Language Models for Electronic Health Records

📅 2025-04-29

🏛️ CL4HEALTH

📈 Citations: 2

✨ Influential: 0

career value

168K/year

🤖 AI Summary

The absence of standardized, text-based electronic health record (EHR) evaluation benchmarks impedes fair comparison and deployment of large language models (LLMs) in clinical downstream tasks. To address this, we introduce the first open-source, Hugging Face–native textual EHR benchmark—systematically transforming MIMIC-IV’s structured clinical data into natural-language sequences via template engineering. The benchmark supports zero-shot prompting, supervised fine-tuning (e.g., Llama-3, Phi-3), and comparison with traditional models. Our key contributions include the first standardized, open-access textual reconstruction of MIMIC-IV and its integration into the Hugging Face ecosystem. Empirical results show that fine-tuned textual models achieve an AUC of 0.86 on in-hospital mortality prediction—matching state-of-the-art tabular models (XGBoost, logistic regression)—thereby validating the viability of the textual pathway. In contrast, zero-shot LLMs underperform markedly (AUC < 0.6), underscoring the critical impact of domain-specific adaptation and data representation on LLM efficacy in healthcare.

Technology Category

Application Category

📝 Abstract

The lack of standardized evaluation benchmarks in the medical domain for text inputs can be a barrier to widely adopting and leveraging the potential of natural language models for health-related downstream tasks. This paper revisited an openly available MIMIC-IV benchmark for electronic health records (EHRs) to address this issue. First, we integrate the MIMIC-IV data within the Hugging Face datasets library to allow an easy share and use of this collection. Second, we investigate the application of templates to convert EHR tabular data to text. Experiments using fine-tuned and zero-shot LLMs on the mortality of patients task show that fine-tuned text-based models are competitive against robust tabular classifiers. In contrast, zero-shot LLMs struggle to leverage EHR representations. This study underlines the potential of text-based approaches in the medical field and highlights areas for further improvement.

Problem

Research questions and friction points this paper is trying to address.

Lack of standardized medical text evaluation benchmarks

Converting EHR tabular data to text using templates

Assessing fine-tuned vs zero-shot LLMs on patient mortality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate MIMIC-IV data with Hugging Face

Convert EHR tabular data to text

Fine-tuned text models outperform tabular classifiers

🔎 Similar Papers

No similar papers found.