MedDINOv3: How to adapt vision foundation models for medical image segmentation?

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Task-specific models for medical image segmentation exhibit poor cross-modal and cross-institutional generalization. Method: This paper proposes a vision foundation model (VFM) adaptation framework tailored for medical imaging. It introduces (1) a multi-scale token aggregation architecture to enhance local-global feature fusion; (2) CT-3M, a large-scale CT dataset, and a multi-stage domain-adaptive pretraining strategy based on DINOv3 to bridge the domain gap between natural and medical images; and (3) a ViT backbone integrated with self-supervised and dense feature learning to improve medical image representation quality. Results: The method achieves state-of-the-art or competitive performance on four major medical segmentation benchmarks. It is the first work to systematically demonstrate the feasibility and superiority of generic vision foundation models as unified backbones for medical image segmentation.

Technology Category

Application Category

📝 Abstract
Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce extbf{MedDINOv3}, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on extbf{CT-3M}, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision foundation models for medical image segmentation
Addressing domain gap between natural and medical images
Improving ViT backbone performance for CT and MRI scans
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts DINOv3 foundation model
Uses multi-scale token aggregation
Domain-adaptive pretraining on CT-3M dataset
🔎 Similar Papers
No similar papers found.
Y
Yuheng Li
Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta
Y
Yizhou Wu
Department of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta
Yuxiang Lai
Yuxiang Lai
Ph.D. Student in Computer Science, Emory University
Computer VisionMedical Imaging
M
Mingzhe Hu
Department of Computer Science, Emory University, Atlanta
X
Xiaofeng Yang
Department of Computer Science, Emory University, Atlanta and Department of Radiation Oncology, Emory University School of Medicine, Atlanta