🤖 AI Summary
The absence of standardized, text-based electronic health record (EHR) evaluation benchmarks impedes fair comparison and deployment of large language models (LLMs) in clinical downstream tasks. To address this, we introduce the first open-source, Hugging Face–native textual EHR benchmark—systematically transforming MIMIC-IV’s structured clinical data into natural-language sequences via template engineering. The benchmark supports zero-shot prompting, supervised fine-tuning (e.g., Llama-3, Phi-3), and comparison with traditional models. Our key contributions include the first standardized, open-access textual reconstruction of MIMIC-IV and its integration into the Hugging Face ecosystem. Empirical results show that fine-tuned textual models achieve an AUC of 0.86 on in-hospital mortality prediction—matching state-of-the-art tabular models (XGBoost, logistic regression)—thereby validating the viability of the textual pathway. In contrast, zero-shot LLMs underperform markedly (AUC < 0.6), underscoring the critical impact of domain-specific adaptation and data representation on LLM efficacy in healthcare.
📝 Abstract
The lack of standardized evaluation benchmarks in the medical domain for text inputs can be a barrier to widely adopting and leveraging the potential of natural language models for health-related downstream tasks. This paper revisited an openly available MIMIC-IV benchmark for electronic health records (EHRs) to address this issue. First, we integrate the MIMIC-IV data within the Hugging Face datasets library to allow an easy share and use of this collection. Second, we investigate the application of templates to convert EHR tabular data to text. Experiments using fine-tuned and zero-shot LLMs on the mortality of patients task show that fine-tuned text-based models are competitive against robust tabular classifiers. In contrast, zero-shot LLMs struggle to leverage EHR representations. This study underlines the potential of text-based approaches in the medical field and highlights areas for further improvement.