DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses few-shot semantic segmentation—pixel-level segmentation of novel classes using only a few annotated images. We propose a unified, lightweight framework built solely upon the DINOv2 encoder. Our method introduces three key innovations: (1) a novel coarse-to-fine cross-model distillation mechanism that transfers segmentation priors from SAM into the DINOv2 feature space; (2) a meta-visual prompt generator leveraging dense similarity matching and semantic embedding; and (3) 4D correlation modeling over support-query image pairs to enhance cross-image matching fidelity. Integrated with a bottleneck adapter and a lightweight decoder, our approach achieves state-of-the-art performance on COCO-20i, PASCAL-5i, and FSS-1000, surpassing prior methods in accuracy while using significantly fewer parameters. This demonstrates the efficacy and strong generalization capability of single-foundation-model-driven few-shot segmentation.

Technology Category

Application Category

📝 Abstract
Few-shot semantic segmentation has gained increasing interest due to its generalization capability, i.e., segmenting pixels of novel classes requiring only a few annotated images. Prior work has focused on meta-learning for support-query matching, with extensive development in both prototype-based and aggregation-based methods. To address data scarcity, recent approaches have turned to foundation models to enhance representation transferability for novel class segmentation. Among them, a hybrid dual-modal framework including both DINOv2 and SAM has garnered attention due to their complementary capabilities. We wonder"can we build a unified model with knowledge from both foundation models?"To this end, we propose FS-DINO, with only DINOv2's encoder and a lightweight segmenter. The segmenter features a bottleneck adapter, a meta-visual prompt generator based on dense similarities and semantic embeddings, and a decoder. Through coarse-to-fine cross-model distillation, we effectively integrate SAM's knowledge into our lightweight segmenter, which can be further enhanced by 4D correlation mining on support-query pairs. Extensive experiments on COCO-20i, PASCAL-5i, and FSS-1000 demonstrate the effectiveness and superiority of our method.
Problem

Research questions and friction points this paper is trying to address.

Unify DINOv2 and SAM for few-shot segmentation
Enhance segmentation via cross-model distillation
Improve accuracy with 4D correlation mining
Innovation

Methods, ideas, or system contributions that make the work stand out.

DINOv2 encoder with lightweight segmenter
Cross-model distillation from SAM
4D correlation mining enhancement
🔎 Similar Papers
No similar papers found.
W
Wei Zhuo
1School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China; 2National Engineering Laboratory of Big Data System Computing Technology, Shenzhen University
Z
Zhiyue Tang
1School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China
Wufeng Xue
Wufeng Xue
Shenzhen University; Xian Jiaotong University; University of Western Ontario
medical image analysiscomputer visionimage processingimage quality assessment
H
Hao Ding
1School of Artificial Intelligence, Shenzhen University, Shenzhen 518060, China
Linlin Shen
Linlin Shen
Shenzhen University
Deep LearningComputer VisionFacial Analysis/RecognitionMedical Image Analysis