Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the heavy reliance on extensive pixel-level annotations and limited cross-domain generalization in ultrasound adnexal mass segmentation by introducing, for the first time, the self-supervised foundation model DINOv3 into medical image segmentation. By integrating DINOv3 with a Dense Prediction Transformer (DPT) decoder and a multi-scale feature fusion strategy, the proposed method jointly models global semantics and local details, substantially reducing annotation requirements. Evaluated on 7,777 clinical ultrasound frames, the approach achieves a Dice coefficient of 0.945 and reduces the 95% Hausdorff distance by 11.4% compared to the strongest convolutional baseline. Notably, it maintains superior performance even with only 25% of the annotated data, significantly enhancing boundary delineation accuracy and few-shot generalization capability.
📝 Abstract
Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA
Problem

Research questions and friction points this paper is trying to address.

adnexal mass segmentation
annotation efficiency
medical image segmentation
domain shift
ultrasound imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model
label-efficient segmentation
DINOv3
medical image segmentation
self-supervised learning
🔎 Similar Papers
No similar papers found.
F
Francesca Fati
Mayo Clinic
A
Alberto Rota
Politecnico di Milano
A
Adriana V. Gregory
Mayo Clinic
A
Anna Catozzo
Mayo Clinic
M
Maria C. Giuliano
Mayo Clinic
M
Mrinal Dhar
Mayo Clinic
L
Luigi De Vitis
Mayo Clinic
A
Annie T. Packard
Mayo Clinic
F
Francesco Multinu
Istituto Europeo di Oncologia
Elena De Momi
Elena De Momi
Politecnico di Milano
medical roboticscomputer visionartificial intelligencehuman robot interaction
C
Carrie L. Langstraat
Mayo Clinic
T
Timothy L. Kline
Mayo Clinic