π€ AI Summary
This work addresses the limitations of existing electronic health record (EHR) pretraining approaches, which struggle to effectively model both the recurrence and emergence of clinical events and often suffer from inflated evaluation metrics due to repeated events. To overcome these challenges, we propose RAVENβa recurrence-aware generative autoregressive pretraining framework that learns by predicting the full sequence of clinical events in the next visit. RAVEN incorporates a recurrence regularization mechanism to mitigate evaluation bias and leverages clinical event tokenization with zero-shot transfer strategies. In zero-shot prediction tasks across multiple disease incidences, RAVEN matches the performance of fully fine-tuned Transformer models and significantly outperforms conventional next-token prediction methods. Furthermore, it demonstrates strong generalization on external cohorts and reveals a synergistic scaling law between model and data size under data-constrained conditions.
π Abstract
While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.