🤖 AI Summary
Addressing the few-shot semantic segmentation (FSS) challenge in the era of foundation models, this work introduces the first benchmark specifically designed for adapting large-scale vision models to FSS. We systematically evaluate five representative models—DINO v2, SAM, CLIP, MAE, and ResNet50-COCO—alongside five adaptation strategies: linear probing, LoRA, feature distillation, prompt tuning, and full fine-tuning. Notably, this is the first comprehensive evaluation of both multimodal and unimodal vision foundation models in FSS. Our results reveal that DINO v2 substantially outperforms all others (achieving an average mIoU 8.2 percentage points higher than the second-best on Pascal-5i and COCO-20i), and linear probing alone attains 97.3% of full fine-tuning performance—drastically reducing computational overhead. These findings challenge prevailing assumptions about the necessity of complex adaptation mechanisms and establish a new empirical baseline and practical guidance for integrating foundation models with FSS.
📝 Abstract
In recent years, the rapid evolution of computer vision has seen the emergence of various foundation models, each tailored to specific data types and tasks. In this study, we explore the adaptation of these models for few-shot semantic segmentation. Specifically, we conduct a comprehensive comparative analysis of four prominent foundation models: DINO V2, Segment Anything, CLIP, Masked AutoEncoders, and of a straightforward ResNet50 pre-trained on the COCO dataset. We also include 5 adaptation methods, ranging from linear probing to fine tuning. Our findings show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods. On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives