A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models

📅 2024-01-20

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Addressing the few-shot semantic segmentation (FSS) challenge in the era of foundation models, this work introduces the first benchmark specifically designed for adapting large-scale vision models to FSS. We systematically evaluate five representative models—DINO v2, SAM, CLIP, MAE, and ResNet50-COCO—alongside five adaptation strategies: linear probing, LoRA, feature distillation, prompt tuning, and full fine-tuning. Notably, this is the first comprehensive evaluation of both multimodal and unimodal vision foundation models in FSS. Our results reveal that DINO v2 substantially outperforms all others (achieving an average mIoU 8.2 percentage points higher than the second-best on Pascal-5i and COCO-20i), and linear probing alone attains 97.3% of full fine-tuning performance—drastically reducing computational overhead. These findings challenge prevailing assumptions about the necessity of complex adaptation mechanisms and establish a new empirical baseline and practical guidance for integrating foundation models with FSS.

Technology Category

Application Category

📝 Abstract

In recent years, the rapid evolution of computer vision has seen the emergence of various foundation models, each tailored to specific data types and tasks. In this study, we explore the adaptation of these models for few-shot semantic segmentation. Specifically, we conduct a comprehensive comparative analysis of four prominent foundation models: DINO V2, Segment Anything, CLIP, Masked AutoEncoders, and of a straightforward ResNet50 pre-trained on the COCO dataset. We also include 5 adaptation methods, ranging from linear probing to fine tuning. Our findings show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods. On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives

Problem

Research questions and friction points this paper is trying to address.

Exploring adaptation of vision foundation models for few-shot semantic segmentation

Proposing a novel benchmark for realistic evaluation of VFM in FSS

Comparing performance of segmentation and self-supervised models using various adaptation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for few-shot segmentation adaptation

Comprehensive analysis of vision foundation models

Parameter efficient fine-tuning for segmentation

🔎 Similar Papers

No similar papers found.