🤖 AI Summary
To address the low representation transfer efficiency of foundation models in medical image segmentation, this paper proposes Dino U-Net: a lightweight architecture that freezes the DINOv3 vision foundation model as its backbone. It introduces a Fidelity-Aware Projection Module (FAPM) to preserve discriminative dense feature information during dimensionality reduction and incorporates a lightweight adapter to effectively fuse high-level semantic cues with low-level spatial details. By avoiding full fine-tuning of large-scale vision models, Dino U-Net significantly improves cross-modal transfer efficiency. Evaluated on seven mainstream medical image segmentation benchmarks, it consistently outperforms state-of-the-art methods. Notably, its performance scales robustly with backbone size—up to 7 billion parameters—demonstrating, for the first time, the scalability and effectiveness of ultra-large vision foundation models in medical image segmentation.
📝 Abstract
Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.