๐ค AI Summary
To address SAMโs lack of semantic awareness and its inability to support open-vocabulary and multi-granularity semantic segmentation, this paper proposes a dual-type composable prompting framework: Type-I prompts align textual class labels with SAMโs base segmentation tokens semantically; Type-II prompts model instance consistency by unifying affinity modeling between semantic/instance queries and SAM tokens. The method requires no fine-tuning, integrating zero-shot SAM segmentation, CLIP-based textโimage matching, affinity graph construction, and hierarchical token merging. It supports semantic, instance, and panoptic segmentation in both open- and closed-vocabulary settings. On open-vocabulary segmentation benchmarks, it achieves state-of-the-art performance, significantly outperforming existing adaptation methods across multiple datasets. Notably, it is the first framework to enable single-model, zero-shot, multi-granularity, open-vocabulary, semantic-aware segmentation.
๐ Abstract
The Segment Anything model (SAM) has shown a generalized ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Specifically, given a set of classes (in texts) and a set of SAM patches, the Type-I prompt judges whether a SAM patch aligns with a text label, and the Type-II prompt judges whether two SAM patches with the same text label also belong to the same instance. To decrease the complexity in dealing with a large number of semantic classes and patches, we establish a unified framework that calculates the affinity between (semantic and instance) queries and SAM patches and merges patches with high affinity to the query. Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains. In particular, it achieves state-of-the-art performance in open-vocabulary segmentation. Our research offers a novel and generalized methodology for equipping vision foundation models like SAM with multi-grained semantic perception abilities.