DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the limitations of existing open-vocabulary semantic segmentation methods, which often rely on vision-language models and suffer from foreground bias and ambiguous spatial localization, leading to inaccurate delineation of object boundaries and background regions. To overcome these challenges, we propose DiSa, a novel framework that explicitly incorporates saliency cues and adopts a divide-and-conquer strategy to decouple foreground and background modeling. DiSa integrates multi-level spatial context for feature refinement through two key components: a Saliency-aware Decoupling Module (SDM) for foreground-background disentanglement and a Hierarchical Refinement Module (HRM) for pixel-wise channel feature enhancement. Built upon CLIP, DiSa consistently outperforms state-of-the-art methods across six benchmark datasets, achieving significant improvements in both segmentation accuracy and boundary sharpness.

Technology Category

Application Category

📝 Abstract

Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module (HRM) that leverages pixel-wise spatial contexts and enables channel-wise feature refinement through multi-level updates. Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation

foreground bias

limited spatial localization

vision-language models

saliency

Innovation

Methods, ideas, or system contributions that make the work stand out.

saliency-aware disentanglement

foreground-background separation

open-vocabulary semantic segmentation