DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses few-shot semantic segmentation—pixel-level segmentation of novel classes using only a few annotated images. We propose a unified, lightweight framework built solely upon the DINOv2 encoder. Our method introduces three key innovations: (1) a novel coarse-to-fine cross-model distillation mechanism that transfers segmentation priors from SAM into the DINOv2 feature space; (2) a meta-visual prompt generator leveraging dense similarity matching and semantic embedding; and (3) 4D correlation modeling over support-query image pairs to enhance cross-image matching fidelity. Integrated with a bottleneck adapter and a lightweight decoder, our approach achieves state-of-the-art performance on COCO-20i, PASCAL-5i, and FSS-1000, surpassing prior methods in accuracy while using significantly fewer parameters. This demonstrates the efficacy and strong generalization capability of single-foundation-model-driven few-shot segmentation.

Technology Category

Application Category

📝 Abstract

Few-shot semantic segmentation has gained increasing interest due to its generalization capability, i.e., segmenting pixels of novel classes requiring only a few annotated images. Prior work has focused on meta-learning for support-query matching, with extensive development in both prototype-based and aggregation-based methods. To address data scarcity, recent approaches have turned to foundation models to enhance representation transferability for novel class segmentation. Among them, a hybrid dual-modal framework including both DINOv2 and SAM has garnered attention due to their complementary capabilities. We wonder"can we build a unified model with knowledge from both foundation models?"To this end, we propose FS-DINO, with only DINOv2's encoder and a lightweight segmenter. The segmenter features a bottleneck adapter, a meta-visual prompt generator based on dense similarities and semantic embeddings, and a decoder. Through coarse-to-fine cross-model distillation, we effectively integrate SAM's knowledge into our lightweight segmenter, which can be further enhanced by 4D correlation mining on support-query pairs. Extensive experiments on COCO-20i, PASCAL-5i, and FSS-1000 demonstrate the effectiveness and superiority of our method.

Problem

Research questions and friction points this paper is trying to address.

Unify DINOv2 and SAM for few-shot segmentation

Enhance segmentation via cross-model distillation

Improve accuracy with 4D correlation mining

Innovation

Methods, ideas, or system contributions that make the work stand out.

DINOv2 encoder with lightweight segmenter

Cross-model distillation from SAM

4D correlation mining enhancement

🔎 Similar Papers

A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models