🤖 AI Summary
This work addresses few-shot semantic segmentation—pixel-level segmentation of novel classes using only a few annotated images. We propose a unified, lightweight framework built solely upon the DINOv2 encoder. Our method introduces three key innovations: (1) a novel coarse-to-fine cross-model distillation mechanism that transfers segmentation priors from SAM into the DINOv2 feature space; (2) a meta-visual prompt generator leveraging dense similarity matching and semantic embedding; and (3) 4D correlation modeling over support-query image pairs to enhance cross-image matching fidelity. Integrated with a bottleneck adapter and a lightweight decoder, our approach achieves state-of-the-art performance on COCO-20i, PASCAL-5i, and FSS-1000, surpassing prior methods in accuracy while using significantly fewer parameters. This demonstrates the efficacy and strong generalization capability of single-foundation-model-driven few-shot segmentation.
📝 Abstract
Few-shot semantic segmentation has gained increasing interest due to its generalization capability, i.e., segmenting pixels of novel classes requiring only a few annotated images. Prior work has focused on meta-learning for support-query matching, with extensive development in both prototype-based and aggregation-based methods. To address data scarcity, recent approaches have turned to foundation models to enhance representation transferability for novel class segmentation. Among them, a hybrid dual-modal framework including both DINOv2 and SAM has garnered attention due to their complementary capabilities. We wonder"can we build a unified model with knowledge from both foundation models?"To this end, we propose FS-DINO, with only DINOv2's encoder and a lightweight segmenter. The segmenter features a bottleneck adapter, a meta-visual prompt generator based on dense similarities and semantic embeddings, and a decoder. Through coarse-to-fine cross-model distillation, we effectively integrate SAM's knowledge into our lightweight segmenter, which can be further enhanced by 4D correlation mining on support-query pairs. Extensive experiments on COCO-20i, PASCAL-5i, and FSS-1000 demonstrate the effectiveness and superiority of our method.