🤖 AI Summary
To address data scarcity, privacy constraints, and insufficient early-diagnostic accuracy in breast ultrasound image analysis, this paper introduces BUSGen—the first foundational generative model tailored to this domain. Trained on over 3.5 million de-identified ultrasound images via large-scale self-supervised pretraining and conditional diffusion modeling, BUSGen integrates anatomical-pathological joint representation learning and few-shot prompt-based fine-tuning to generate high-fidelity, task-specific synthetic data. It innovatively enables privacy-preserving data sharing and demonstrates statistical equivalence between generated and real data in downstream tasks (p < 0.0001). Experiments show BUSGen improves early-diagnostic sensitivity by 16.5% over the average performance of nine senior radiologists and significantly enhances downstream model generalizability. The model and a public demo platform are open-sourced.
📝 Abstract
Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired extensive knowledge of breast structures, pathological features, and clinical variations. With few-shot adaptation, BUSGen can generate repositories of realistic and informative task-specific data, facilitating the development of models for a wide range of downstream tasks. Extensive experiments highlight BUSGen's exceptional adaptability, significantly exceeding real-data-trained foundational models in breast cancer screening, diagnosis, and prognosis. In breast cancer early diagnosis, our approach outperformed all board-certified radiologists (n=9), achieving an average sensitivity improvement of 16.5% (P-value<0.0001). Additionally, we characterized the scaling effect of using generated data which was as effective as the collected real-world data for training diagnostic models. Moreover, extensive experiments demonstrated that our approach improved the generalization ability of downstream models. Importantly, BUSGen protected patient privacy by enabling fully de-identified data sharing, making progress forward in secure medical data utilization. An online demo of BUSGen is available at https://aibus.bio.