🤖 AI Summary
Existing foundational models exhibit significant limitations in echocardiographic video analysis, failing to effectively capture the spatiotemporal coupling dynamics between cardiac anatomy and beating rhythm. To address this gap, we propose the first general-purpose foundation model specifically designed for echocardiographic videos. Our method introduces a novel spatiotemporally consistent masking strategy and a cycle-driven contrastive learning framework, enabling unsupervised joint modeling of anatomy and rhythm for the first time. Built upon a video Transformer architecture, the model undergoes large-scale multi-view self-supervised pretraining on over 290,000 echocardiographic videos (20 million frames). Evaluated on four clinical downstream tasks—including view classification, functional assessment, pathology detection, and segmentation—the model consistently outperforms state-of-the-art methods, including both task-specific models and general-purpose vision foundation models. It demonstrates markedly improved generalization and transferability across views, modalities, and low-label regimes.
📝 Abstract
Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.