🤖 AI Summary
This work addresses the heavy reliance on extensive pixel-level annotations and limited cross-domain generalization in ultrasound adnexal mass segmentation by introducing, for the first time, the self-supervised foundation model DINOv3 into medical image segmentation. By integrating DINOv3 with a Dense Prediction Transformer (DPT) decoder and a multi-scale feature fusion strategy, the proposed method jointly models global semantics and local details, substantially reducing annotation requirements. Evaluated on 7,777 clinical ultrasound frames, the approach achieves a Dice coefficient of 0.945 and reduces the 95% Hausdorff distance by 11.4% compared to the strongest convolutional baseline. Notably, it maintains superior performance even with only 25% of the annotated data, significantly enhancing boundary delineation accuracy and few-shot generalization capability.
📝 Abstract
Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA