🤖 AI Summary
Most existing EHR AI models are single-task architectures, limiting their generalizability and clinical utility. To address this, we propose the first universal foundation model for electronic health records (EHRs), introducing a temporal tokenization learning framework that explicitly models patient longitudinal trajectories. Our unified architecture jointly supports three core capabilities: clinical feature representation, zero-shot prediction, and synthetic EHR generation. Methodologically, we integrate a time-aware Transformer, a scalable clinical vocabulary mechanism, multi-task pretraining, and zero-shot inference—enabling cross-institutional generalization and task adaptation without fine-tuning. We rigorously evaluate the model on multiple external, real-world EHR datasets. It achieves state-of-the-art performance across diverse tasks—including risk prediction, patient representation learning, and realistic synthetic data generation—demonstrating substantial improvements in deployment efficiency and generalizability for clinical decision support and cohort discovery.
📝 Abstract
Electronic Health Records (EHRs) provide a rich, longitudinal view of patient health and hold significant potential for advancing clinical decision support, risk prediction, and data-driven healthcare research. However, most artificial intelligence (AI) models for EHRs are designed for narrow, single-purpose tasks, limiting their generalizability and utility in real-world settings. Here, we present CEHR-GPT, a general-purpose foundation model for EHR data that unifies three essential capabilities - feature representation, zero-shot prediction, and synthetic data generation - within a single architecture. To support temporal reasoning over clinical sequences, cehrgpt{} incorporates a novel time-token-based learning framework that explicitly encodes patients' dynamic timelines into the model structure. CEHR-GPT demonstrates strong performance across all three tasks and generalizes effectively to external datasets through vocabulary expansion and fine-tuning. Its versatility enables rapid model development, cohort discovery, and patient outcome forecasting without the need for task-specific retraining.