USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Ultrasound imaging suffers from high noise levels, scarce annotated data, and poor cross-domain generalizability—key bottlenecks limiting deep learning performance. To address these challenges, we propose the first large-scale, ultrasound-specific self-supervised Masked Autoencoder (MAE) framework, pretrained exclusively on ultrasound data. Leveraging our newly curated open-source dataset OpenUS-46—comprising 370K 2D/3D images from 46 diverse sources—we perform unsupervised pretraining via masked patch reconstruction using a Vision Transformer (ViT) encoder-decoder architecture, enabling learning of anatomy-agnostic representations. Downstream fine-tuning achieves F1 scores of 81.6%, 79.6%, and 82.4% on breast cancer, ovarian tumor, and gastrointestinal stromal tumor classification, respectively—substantially outperforming CNNs and standard ViTs, and matching or exceeding the supervised foundation model UltraSam. This work establishes, for the first time, the efficacy and generalizability of ultrasound-specific self-supervised pretraining.

Technology Category

Application Category

📝 Abstract

Ultrasound imaging is one of the most widely used diagnostic modalities, offering real-time, radiation-free assessment across diverse clinical domains. However, interpretation of ultrasound images remains challenging due to high noise levels, operator dependence, and limited field of view, resulting in substantial inter-observer variability. Current Deep Learning approaches are hindered by the scarcity of large labeled datasets and the domain gap between general and sonographic images, which limits the transferability of models pretrained on non-medical data. To address these challenges, we introduce the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), the first large-scale self-supervised MAE framework pretrained exclusively on ultrasound data. The model was pre-trained on 370,000 2D and 3D ultrasound images curated from 46 open-source datasets, collectively termed OpenUS-46, spanning over twenty anatomical regions. This curated dataset has been made publicly available to facilitate further research and reproducibility. Using a Vision Transformer encoder-decoder architecture, USF-MAE reconstructs masked image patches, enabling it to learn rich, modality-specific representations directly from unlabeled data. The pretrained encoder was fine-tuned on three public downstream classification benchmarks: BUS-BRA (breast cancer), MMOTU-2D (ovarian tumors), and GIST514-DB (gastrointestinal stromal tumors). Across all tasks, USF-MAE consistently outperformed conventional CNN and ViT baselines, achieving F1-scores of 81.6%, 79.6%, and 82.4%, respectively. Despite not using labels during pretraining, USF-MAE approached the performance of the supervised foundation model UltraSam on breast cancer classification and surpassed it on the other tasks, demonstrating strong cross-anatomical generalization.

Problem

Research questions and friction points this paper is trying to address.

Addresses ultrasound image interpretation challenges from noise and operator variability

Overcomes limited labeled datasets and domain gaps in medical imaging

Enhances cross-anatomical generalization for ultrasound diagnostic tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised masked autoencoding for ultrasound images

Pretrained on 370,000 ultrasound images from 46 datasets

Vision Transformer reconstructs masked patches for representation learning

🔎 Similar Papers

Improving Representation of High-frequency Components for Medical Foundation Models