Benchmarking Ultrasound Foundation Models for Fetal Plane Classification

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the generalization challenges in ultrasound fetal plane classification arising from operator dependency, image noise, and scarce annotations by systematically evaluating the transferability of multiple ultrasound-specific foundation models (USFM, MOFO, UltraSAM, FetalCLIP) alongside general-purpose vision models (ResNet50, EfficientNet-V2, DINOv3). Employing both full fine-tuning and linear probing strategies, the evaluation is conducted via five-fold patient-level cross-validation on two heterogeneous datasets from Spanish and African populations. The work provides the first comprehensive analysis of how pretraining objectives critically influence downstream performance: FetalCLIP achieves the best results under linear probing (in-domain F1=0.9261, out-of-domain F1=0.9731), whereas USFM excels with full fine-tuning (in-domain F1=0.9476, out-of-domain F1=0.9515).
📝 Abstract
Ultrasound is widely used in obstetric care due to its safety, accessibility, and real-time imaging. However, interpretation remains operator-dependent and susceptible to noise and artifacts. Deep learning models have shown strong performance to solve these problem, but they typically require large annotated datasets that are difficult to obtain in clinical ultrasound. Foundation models (FMs) offer an alternative, using a large number of ultrasound images to learn transferable representations that can generalize with limited labeled data. This work presents a comprehensive benchmark of ultrasound-specific FMs for fetal plane classification. We evaluated four ultrasound FMs (USFM, MOFO, UltraSAM, FetalCLIP) against two CNN baselines (ResNet50, EfficientNet-V2) and a ViT (DINOv3) pretrained on natural images. We trained all models under two complementary settings: full fine-tuning and linear probing with a frozen encoder. All models were trained using 5-fold patient-level cross-validation on a Spanish fetal ultrasound dataset and tested on both in-domain data and an external African cohort to assess cross-population generalization. We found that FetalCLIP achieved the best results in the linear probing setting (F1 = 0.9261 for in-domain, F1 = 0.9731 for out-of-domain), while USFM performed best in the full fine-tuning setting (F1 = 0.9476 for in-domain, F1 = 0.9515 for out-of-domain). MOFO and UltraSAM degraded most in both settings, underperforming natural image pretrained models in some cases. These findings highlight how the choice of pretrained model strongly affects fetal plane classification performance, since different pretraining objectives lead to different levels of transferability.
Problem

Research questions and friction points this paper is trying to address.

fetal plane classification
ultrasound foundation models
limited labeled data
cross-population generalization
operator-dependent interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation models
fetal plane classification
ultrasound benchmarking
cross-population generalization
linear probing
🔎 Similar Papers
No similar papers found.