Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the heavy reliance on extensive pixel-level annotations and limited cross-domain generalization in ultrasound adnexal mass segmentation by introducing, for the first time, the self-supervised foundation model DINOv3 into medical image segmentation. By integrating DINOv3 with a Dense Prediction Transformer (DPT) decoder and a multi-scale feature fusion strategy, the proposed method jointly models global semantics and local details, substantially reducing annotation requirements. Evaluated on 7,777 clinical ultrasound frames, the approach achieves a Dice coefficient of 0.945 and reduces the 95% Hausdorff distance by 11.4% compared to the strongest convolutional baseline. Notably, it maintains superior performance even with only 25% of the annotated data, significantly enhancing boundary delineation accuracy and few-shot generalization capability.

Technology Category

Application Category

📝 Abstract

Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA

Problem

Research questions and friction points this paper is trying to address.

adnexal mass segmentation

annotation efficiency

medical image segmentation

domain shift

ultrasound imaging

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation model

label-efficient segmentation

DINOv3