🤖 AI Summary
This work addresses the challenge of integrating sparse, irregular electronic health records (EHR) with dense yet clinically shallow wearable sensor data to construct continuous, trustworthy longitudinal health representations. The authors propose a multimodal foundation model that employs modality-specific encoders coupled with a shared temporal backbone to jointly model EHR and wearable data as a continuous-time latent process. The model is pretrained using self-supervised and cross-modal objectives, enabling early deep fusion of the two data types for the first time. The resulting representations are temporally coherent and clinically interpretable, significantly outperforming unimodal baselines in physiological forecasting and risk prediction tasks—particularly excelling in long-term prediction and scenarios with high missingness rates.
📝 Abstract
Foundation models trained on electronic health records show strong performance on many clinical prediction tasks but are limited by sparse and irregular documentation. Wearable devices provide dense continuous physiological signals but lack semantic grounding. Existing methods usually model these data sources separately or combine them through late fusion. We propose a multimodal foundation model that jointly represents electronic health records and wearable data as a continuous time latent process. The model uses modality specific encoders and a shared temporal backbone pretrained with self supervised and cross modal objectives. This design produces representations that are temporally coherent and clinically grounded. Across forecasting physiological and risk modeling tasks the model outperforms strong electronic health record only and wearable only baselines especially at long horizons and under missing data. These results show that joint electronic health record and wearable pretraining yields more faithful representations of longitudinal health.