🤖 AI Summary
Ultrasound echocardiography foundation models lack standardized evaluation benchmarks due to noisy data, high frame redundancy, and scarcity of publicly available datasets—leading current studies to rely predominantly on private data, thereby impairing comparability and reproducibility. To address this, we introduce CardioBench: the first open benchmark specifically designed for evaluating echocardiography foundation models. It integrates eight public datasets, covering four regression and five classification tasks, and uniformly supports three evaluation paradigms—zero-shot transfer, linear probing, and representation alignment. Leveraging cardiac-specific encoders, temporal modeling, retrieval augmentation, and domain-aware text encoding, our analysis reveals complementary strengths between general-purpose and domain-specific encoders: general encoders approach linear-probe performance on multiple tasks; temporal modeling markedly improves functional regression; yet fine-grained view classification and pathology identification remain challenging. CardioBench’s preprocessing pipeline and evaluation toolkit are fully open-sourced.
📝 Abstract
Foundation models (FMs) are reshaping medical imaging, yet their application in echocardiography remains limited. While several echocardiography-specific FMs have recently been introduced, no standardized benchmark exists to evaluate them. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited public datasets. Most existing solutions evaluate on private data, restricting comparability. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography FMs. CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. We evaluate several leading FM, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our results highlight complementary strengths across model families: temporal modeling is critical for functional regression, retrieval provides robustness under distribution shift, and domain-specific text encoders capture physiologically meaningful axes. General-purpose encoders transfer strongly and often close the gap with probing, but struggle with fine-grained distinctions like view classification and subtle pathology recognition. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point and offers actionable insights to guide the design of future echocardiography foundation models.