Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the low representation transfer efficiency of foundation models in medical image segmentation, this paper proposes Dino U-Net: a lightweight architecture that freezes the DINOv3 vision foundation model as its backbone. It introduces a Fidelity-Aware Projection Module (FAPM) to preserve discriminative dense feature information during dimensionality reduction and incorporates a lightweight adapter to effectively fuse high-level semantic cues with low-level spatial details. By avoiding full fine-tuning of large-scale vision models, Dino U-Net significantly improves cross-modal transfer efficiency. Evaluated on seven mainstream medical image segmentation benchmarks, it consistently outperforms state-of-the-art methods. Notably, its performance scales robustly with backbone size—up to 7 billion parameters—demonstrating, for the first time, the scalability and effectiveness of ultra-large vision foundation models in medical image segmentation.

Technology Category

Application Category

📝 Abstract

Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

Problem

Research questions and friction points this paper is trying to address.

Transferring foundation model features for medical segmentation

Preserving feature fidelity during dimensionality reduction

Achieving scalable accuracy across diverse medical images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen DINOv3 backbone with adapter

Introduces fidelity-aware projection module

Leverages dense-pretrained foundation model features

🔎 Similar Papers

No similar papers found.