A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models

📅 2024-01-20
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the few-shot semantic segmentation (FSS) challenge in the era of foundation models, this work introduces the first benchmark specifically designed for adapting large-scale vision models to FSS. We systematically evaluate five representative models—DINO v2, SAM, CLIP, MAE, and ResNet50-COCO—alongside five adaptation strategies: linear probing, LoRA, feature distillation, prompt tuning, and full fine-tuning. Notably, this is the first comprehensive evaluation of both multimodal and unimodal vision foundation models in FSS. Our results reveal that DINO v2 substantially outperforms all others (achieving an average mIoU 8.2 percentage points higher than the second-best on Pascal-5i and COCO-20i), and linear probing alone attains 97.3% of full fine-tuning performance—drastically reducing computational overhead. These findings challenge prevailing assumptions about the necessity of complex adaptation mechanisms and establish a new empirical baseline and practical guidance for integrating foundation models with FSS.

Technology Category

Application Category

📝 Abstract
In recent years, the rapid evolution of computer vision has seen the emergence of various foundation models, each tailored to specific data types and tasks. In this study, we explore the adaptation of these models for few-shot semantic segmentation. Specifically, we conduct a comprehensive comparative analysis of four prominent foundation models: DINO V2, Segment Anything, CLIP, Masked AutoEncoders, and of a straightforward ResNet50 pre-trained on the COCO dataset. We also include 5 adaptation methods, ranging from linear probing to fine tuning. Our findings show that DINO V2 outperforms other models by a large margin, across various datasets and adaptation methods. On the other hand, adaptation methods provide little discrepancy in the obtained results, suggesting that a simple linear probing can compete with advanced, more computationally intensive, alternatives
Problem

Research questions and friction points this paper is trying to address.

Exploring adaptation of vision foundation models for few-shot semantic segmentation
Proposing a novel benchmark for realistic evaluation of VFM in FSS
Comparing performance of segmentation and self-supervised models using various adaptation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for few-shot segmentation adaptation
Comprehensive analysis of vision foundation models
Parameter efficient fine-tuning for segmentation
🔎 Similar Papers
No similar papers found.
R
Reda Bensaid
IMT Atlantique, Brest, France; Polytechnique Montréal, Montreal, Canada
Vincent Gripon
Vincent Gripon
IMT Atlantique and Lab-STICC
Deep LearningFew-Shot LearningArtificial Intelligence
F
Franccois Leduc-Primeau
Polytechnique Montréal, Montreal, Canada
Lukas Mauch
Lukas Mauch
Sony Europe B.V.
machine learningsignal processing
G
G. B. Hacene
Sony Europe, B.V. Stuttgart Laboratory 1, Germany; Mila, Montreal, Canada
F
Fabien Cardinaux
Sony Europe, B.V. Stuttgart Laboratory 1, Germany