🤖 AI Summary
SAM2, optimized for visual tracking, suffers from entangled semantic features that hinder generalization to unseen classes in few-shot segmentation. Method: We propose SANSA, the first framework to uncover and exploit rich high-level semantic structures latent in SAM2’s vision features—without modifying its weights—via a lightweight semantic alignment module and feature decoupling strategy, enabling class-aware prompt-and-propagate segmentation with diverse input prompts (points, boxes, scribbles). Results: SANSA achieves state-of-the-art performance on generalization-oriented few-shot segmentation benchmarks; significantly outperforms general-purpose segmentation models under in-context learning; and enables efficient inference with negligible parameter overhead. Its core contribution is the zero-shot, weight-free decoupling and explicit utilization of SAM2’s pre-trained semantic knowledge, effectively bridging the gap between tracking-oriented foundation models and semantic segmentation tasks.
📝 Abstract
Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.