CEHR-GPT: A Scalable Multi-Task Foundation Model for Electronic Health Records

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Most existing EHR AI models are single-task architectures, limiting their generalizability and clinical utility. To address this, we propose the first universal foundation model for electronic health records (EHRs), introducing a temporal tokenization learning framework that explicitly models patient longitudinal trajectories. Our unified architecture jointly supports three core capabilities: clinical feature representation, zero-shot prediction, and synthetic EHR generation. Methodologically, we integrate a time-aware Transformer, a scalable clinical vocabulary mechanism, multi-task pretraining, and zero-shot inference—enabling cross-institutional generalization and task adaptation without fine-tuning. We rigorously evaluate the model on multiple external, real-world EHR datasets. It achieves state-of-the-art performance across diverse tasks—including risk prediction, patient representation learning, and realistic synthetic data generation—demonstrating substantial improvements in deployment efficiency and generalizability for clinical decision support and cohort discovery.

Technology Category

Application Category

📝 Abstract
Electronic Health Records (EHRs) provide a rich, longitudinal view of patient health and hold significant potential for advancing clinical decision support, risk prediction, and data-driven healthcare research. However, most artificial intelligence (AI) models for EHRs are designed for narrow, single-purpose tasks, limiting their generalizability and utility in real-world settings. Here, we present CEHR-GPT, a general-purpose foundation model for EHR data that unifies three essential capabilities - feature representation, zero-shot prediction, and synthetic data generation - within a single architecture. To support temporal reasoning over clinical sequences, cehrgpt{} incorporates a novel time-token-based learning framework that explicitly encodes patients' dynamic timelines into the model structure. CEHR-GPT demonstrates strong performance across all three tasks and generalizes effectively to external datasets through vocabulary expansion and fine-tuning. Its versatility enables rapid model development, cohort discovery, and patient outcome forecasting without the need for task-specific retraining.
Problem

Research questions and friction points this paper is trying to address.

Develops a scalable multi-task foundation model for EHRs
Unifies feature representation, prediction, and synthetic data generation
Encodes patients' dynamic timelines for temporal clinical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified foundation model for EHR data
Time-token-based learning framework for timelines
Generalizes through vocabulary expansion and fine-tuning
🔎 Similar Papers
No similar papers found.
C
Chao Pang
Department of Biomedical Informatics, Columbia University Irving Medical Center; Observational Health Data Sciences and Informatics
J
Jiheum Park
Department of Medicine, Columbia University Irving Medical Center
X
Xinzhuo Jiang
Department of Biomedical Informatics, Columbia University Irving Medical Center; Observational Health Data Sciences and Informatics
N
Nishanth Parameshwar Pavinkurve
Department of Biomedical Informatics, Columbia University Irving Medical Center; Observational Health Data Sciences and Informatics
K
Krishna S. Kalluri
Department of Biomedical Informatics, Columbia University Irving Medical Center; Observational Health Data Sciences and Informatics
Shalmali Joshi
Shalmali Joshi
Columbia University
Artificial IntelligenceMachine LearningBiomedical SciencesClinical Informatics
Noémie Elhadad
Noémie Elhadad
Associate Professor and Chair of Biomedical Informatics, Columbia University
machine learning for healthcarehealth informaticsnatural language processingbiomedical informaticswomen's health
K
Karthik Natarajan
Department of Biomedical Informatics, Columbia University Irving Medical Center; Observational Health Data Sciences and Informatics; Medical Informatics Services, NewYork-Presbyterian Hospital