Effective SAM Combination for Open-Vocabulary Semantic Segmentation

📅 2024-11-22

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Open-vocabulary semantic segmentation suffers from high computational overhead and poor memory efficiency in two-stage approaches (e.g., SAM + CLIP). To address this, we propose ESC-Net, the first single-stage model that directly repurposes the SAM decoder for class-agnostic segmentation, eliminating redundant feature reconstruction. We introduce a novel image-text correlation-driven pseudo-prompt embedding mechanism, seamlessly integrated into SAM’s promptable framework to enable spatially aware vision-language prior fusion. ESC-Net is trained end-to-end via differentiable optimization, ensuring joint refinement of all components. On ADE20K, PASCAL-VOC, and PASCAL-Context, ESC-Net surpasses state-of-the-art methods at significantly lower computational cost. Ablation studies demonstrate its robustness under challenging conditions—including occlusion and small-object segmentation—validating the effectiveness of our design.

Technology Category

Application Category

📝 Abstract

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in open-vocabulary segmentation

Improving memory efficiency in semantic segmentation models

Enhancing accuracy and efficiency in mask predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-stage model combining SAM and CLIP

Embed pseudo prompts for spatial aggregation

Efficient framework for accurate segmentation

🔎 Similar Papers

SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation