CEHR-GPT: A Scalable Multi-Task Foundation Model for Electronic Health Records

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Most existing EHR AI models are single-task architectures, limiting their generalizability and clinical utility. To address this, we propose the first universal foundation model for electronic health records (EHRs), introducing a temporal tokenization learning framework that explicitly models patient longitudinal trajectories. Our unified architecture jointly supports three core capabilities: clinical feature representation, zero-shot prediction, and synthetic EHR generation. Methodologically, we integrate a time-aware Transformer, a scalable clinical vocabulary mechanism, multi-task pretraining, and zero-shot inference—enabling cross-institutional generalization and task adaptation without fine-tuning. We rigorously evaluate the model on multiple external, real-world EHR datasets. It achieves state-of-the-art performance across diverse tasks—including risk prediction, patient representation learning, and realistic synthetic data generation—demonstrating substantial improvements in deployment efficiency and generalizability for clinical decision support and cohort discovery.

Technology Category

Application Category

📝 Abstract

Electronic Health Records (EHRs) provide a rich, longitudinal view of patient health and hold significant potential for advancing clinical decision support, risk prediction, and data-driven healthcare research. However, most artificial intelligence (AI) models for EHRs are designed for narrow, single-purpose tasks, limiting their generalizability and utility in real-world settings. Here, we present CEHR-GPT, a general-purpose foundation model for EHR data that unifies three essential capabilities - feature representation, zero-shot prediction, and synthetic data generation - within a single architecture. To support temporal reasoning over clinical sequences, cehrgpt{} incorporates a novel time-token-based learning framework that explicitly encodes patients' dynamic timelines into the model structure. CEHR-GPT demonstrates strong performance across all three tasks and generalizes effectively to external datasets through vocabulary expansion and fine-tuning. Its versatility enables rapid model development, cohort discovery, and patient outcome forecasting without the need for task-specific retraining.

Problem

Research questions and friction points this paper is trying to address.

Develops a scalable multi-task foundation model for EHRs

Unifies feature representation, prediction, and synthetic data generation

Encodes patients' dynamic timelines for temporal clinical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified foundation model for EHR data

Time-token-based learning framework for timelines

Generalizes through vocabulary expansion and fine-tuning

🔎 Similar Papers

EMERGE: Enhancing Multimodal Electronic Health Records Predictive Modeling with Retrieval-Augmented Generation