Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low representation transfer efficiency of foundation models in medical image segmentation, this paper proposes Dino U-Net: a lightweight architecture that freezes the DINOv3 vision foundation model as its backbone. It introduces a Fidelity-Aware Projection Module (FAPM) to preserve discriminative dense feature information during dimensionality reduction and incorporates a lightweight adapter to effectively fuse high-level semantic cues with low-level spatial details. By avoiding full fine-tuning of large-scale vision models, Dino U-Net significantly improves cross-modal transfer efficiency. Evaluated on seven mainstream medical image segmentation benchmarks, it consistently outperforms state-of-the-art methods. Notably, its performance scales robustly with backbone size—up to 7 billion parameters—demonstrating, for the first time, the scalability and effectiveness of ultra-large vision foundation models in medical image segmentation.

Technology Category

Application Category

📝 Abstract
Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.
Problem

Research questions and friction points this paper is trying to address.

Transferring foundation model features for medical segmentation
Preserving feature fidelity during dimensionality reduction
Achieving scalable accuracy across diverse medical images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen DINOv3 backbone with adapter
Introduces fidelity-aware projection module
Leverages dense-pretrained foundation model features
🔎 Similar Papers
No similar papers found.
Y
Yifan Gao
University of Science and Technology of China, Hefei, China
H
Haoyue Li
University of Science and Technology of China, Hefei, China
Feng Yuan
Feng Yuan
Postdoctoral Fellow of Computer Science and Engineering, The Chinese University of Hong Kong
Computer Aided DesignFault-Tolerant Computing
Xiaosong Wang
Xiaosong Wang
Shanghai AI Laboratory
Medical Image AnalysisComputer VisionVision and Language
X
Xin Gao
University of Science and Technology of China, Hefei, China