Does DINOv3 Set a New Medical Vision Standard?

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision models often rely on domain-specific architectures and pretraining, limiting generalizability and increasing development overhead. Method: This work systematically evaluates DINOv3—a state-of-the-art self-supervised ViT pretrained on natural images—as a generic encoder for multimodal medical vision tasks, including 2D/3D classification and segmentation, without any medical-domain pretraining. Experiments span major modalities (CT, MRI, X-ray) and specialized domains (WSI, EM, PET). Contribution/Results: DINOv3 establishes new state-of-the-art performance across most benchmarks, surpassing biomedical-specialized models (e.g., BiomedCLIP, CT-Net). It reveals, for the first time, non-uniform scaling behavior and feature degradation in medical imaging—critical insights for adaptation. Moreover, it demonstrates strong transferability to emerging tasks such as 3D reconstruction. Collectively, this study validates generic vision foundation models as robust, off-the-shelf encoders, advocating a paradigm shift from task-specific design to “pretrain–lightweight adaptation” in medical vision research.

Technology Category

Application Category

📝 Abstract
The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.
Problem

Research questions and friction points this paper is trying to address.

Evaluating DINOv3's transfer to medical imaging tasks
Assessing performance without domain-specific pre-training
Identifying limitations in specialized medical domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses DINOv3 as unified medical encoder
Benchmarks across 2D/3D classification segmentation
Analyzes model scalability varying sizes resolutions
🔎 Similar Papers
No similar papers found.
Che Liu
Che Liu
Imperial College London
Multimodal learningAI4Medicine
Yinda Chen
Yinda Chen
University of Science and Technology of China, Xiamen University
Machine Learning TheorySelf-supervised LearningImage Compression
H
Haoyuan Shi
University of Science and Technology of China
Jinpeng Lu
Jinpeng Lu
University of Science and Technology of China
Biomedical Image ProcessingMultimodal Learning
Bailiang Jian
Bailiang Jian
Techinical Unversity of Munich
medical image registration
Jiazhen Pan
Jiazhen Pan
Technical University of Munich
Machine LearningMedical Image ComputingBiomedical Image Analysis
L
Linghan Cai
Dresden University of Technology
J
Jiayi Wang
University of Erlangen-Nuremberg
Yundi Zhang
Yundi Zhang
Technical University of Munich
Computer visionMedical imagingMRI
J
Jun Li
Technical University of Munich (TUM), Munich Center for Machine Learning
Cosmin I. Bercea
Cosmin I. Bercea
Technical University of Munich
Computer VisionMultimodal LearningGenerative AIAnomaly DetectionMedical Image Analysis
Cheng Ouyang
Cheng Ouyang
University of Oxford
Cardiovascular imagingMedical imaging computing
C
Chen Chen
University of Sheffield
Zhiwei Xiong
Zhiwei Xiong
University of Science and Technology of China
computational photographybiomedical image analysis
B
Benedikt Wiestler
Technical University of Munich (TUM), Munich Center for Machine Learning
Christian Wachinger
Christian Wachinger
Technical University of Munich
AI in Medical ImagingGeometric Deep LearningCausal InferenceMulti-Modal Diagnostics
Daniel Rueckert
Daniel Rueckert
Technical University of Munich and Imperial College London
Machine LearningMedical Image ComputingBiomedical Image AnalysisComputer Vision
W
Wenjia Bai
Imperial College London
Rossella Arcucci
Rossella Arcucci
Associate Professor, Imperial College London
AI4GoodData LearningData AssimilationMachine LearningDeep Learning