🤖 AI Summary
Traditional cardiac signal analysis models suffer from poor generalizability, reliance on homogeneous data, and static architectures. To address these limitations, we propose the first large-scale multimodal foundation model for cardiac health awareness. Built upon a Transformer architecture, it employs generative masked pretraining to jointly model heterogeneous ECG/PPG signals and clinical text reports from 1.7 million individuals. This enables unified representation learning across devices, leads, and tasks. Compared to unimodal, single-task approaches, our model achieves significant performance gains across diverse downstream tasks—including diagnostic classification, vital sign estimation, and prognostic prediction—demonstrating superior generalizability and robustness. It establishes a transferable foundation model paradigm for intelligent cardiac health monitoring in real-world, clinically heterogeneous environments.
📝 Abstract
Cardiac biosignals, such as electrocardiograms (ECG) and photoplethysmograms (PPG), are of paramount importance for the diagnosis, prevention, and management of cardiovascular diseases, and have been extensively used in a variety of clinical tasks. Conventional deep learning approaches for analyzing these signals typically rely on homogeneous datasets and static bespoke models, limiting their robustness and generalizability across diverse clinical settings and acquisition protocols. In this study, we present a cardiac sensing foundation model (CSFM) that leverages advanced transformer architectures and a generative, masked pretraining strategy to learn unified representations from vast, heterogeneous health records. Our model is pretrained on an innovative multi-modal integration of data from multiple large-scale datasets (including MIMIC-III-WDB, MIMIC-IV-ECG, and CODE), comprising cardiac signals and the corresponding clinical or machine-generated text reports from approximately 1.7 million individuals. We demonstrate that the embeddings derived from our CSFM not only serve as effective feature extractors across diverse cardiac sensing scenarios, but also enable seamless transfer learning across varying input configurations and sensor modalities. Extensive evaluations across diagnostic tasks, demographic information recognition, vital sign measurement, clinical outcome prediction, and ECG question answering reveal that CSFM consistently outperforms traditional one-modal-one-task approaches. Notably, CSFM exhibits robust performance across multiple ECG lead configurations from standard 12-lead systems to single-lead setups, and in scenarios where only ECG, only PPG, or a combination thereof is available. These findings highlight the potential of CSFM as a versatile and scalable solution, for comprehensive cardiac monitoring.