Split Matching for Inductive Zero-shot Semantic Segmentation

πŸ“… 2025-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Zero-shot semantic segmentation (ZSS) suffers from three key challenges: imprecise localization of unseen classes, overfitting to seen classes, and erroneous assignment of unseen-class pixels to background due to standard Hungarian matching. Method: We propose a Split Matching framework that decouples Hungarian matching in the inductive ZSS setting, separately modeling pixel-to-class correspondences for seen and potential unseen classes. To generate high-quality pseudo-masks, we introduce CLIP-based dense feature clustering; additionally, we design a multi-scale residual feature aggregation module and a query grouping optimization module to enhance cross-class discriminability. Contribution/Results: This is the first work to achieve matching decoupling in ZSS, effectively mitigating background misclassification. Our method achieves state-of-the-art performance on Pascal-5i and COCO-20i benchmarks, with substantial improvements in unseen-class mIoU.

Technology Category

Application Category

πŸ“ Abstract
Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Overfitting to seen categories in zero-shot semantic segmentation
Misclassification of unseen categories as background in query-based frameworks
Lack of supervision for unseen classes in Hungarian matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Split Matching for decoupling Hungarian matching
Uses CLIP features for pseudo masks and embeddings
Introduces Multi-scale Feature Enhancement module
πŸ”Ž Similar Papers
No similar papers found.