Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot semantic segmentation (ZSS) aims to segment both seen and unseen classes using only seen-class supervision, yet existing CLIP-based distillation methods suffer from two key limitations: (1) insufficient spatial precision in aligning visual features with textual embeddings, and (2) a semantic gap between CLIP’s global image representations and fine-grained, local segmentation features. To address these, we propose Chimera-Seg—a novel architecture featuring selective global distillation. Specifically, we freeze certain CLIP modules to bridge the semantic gap, jointly perform dense feature distillation and text-embedding alignment, and introduce a semantic alignment module to co-optimize visual–linguistic space consistency. Evaluated on Pascal-5i and COCO-20i benchmarks, Chimera-Seg achieves +0.9% and +1.2% improvements in harmonic mean IoU (hIoU), respectively, significantly advancing the state-of-the-art in ZSS.

Technology Category

Application Category

📝 Abstract
Zero-shot Semantic Segmentation (ZSS) aims to segment both seen and unseen classes using supervision from only seen classes. Beyond adaptation-based methods, distillation-based approaches transfer vision-language alignment of vision-language model, e.g., CLIP, to segmentation models. However, such knowledge transfer remains challenging due to: (1) the difficulty of aligning vision-based features with the textual space, which requires combining spatial precision with vision-language alignment; and (2) the semantic gap between CLIP's global representations and the local, fine-grained features of segmentation models. To address challenge (1), we propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head, like the Chimera in Greek mythology, combining spatial precision with vision-language alignment. Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space. The CSH incorporates a frozen subnetwork and fixed projection layers from the CLIP visual encoder, along with lightweight trainable components. The partial module from CLIP visual encoder, paired with the segmentation model, retains segmentation capability while easing the mapping to CLIP's semantic space. To address challenge (2), we propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token, while gradually reducing the number of features used for alignment as training progresses. Besides, we also use a Semantic Alignment Module (SAM) to further align dense visual features with semantic embeddings extracted from the frozen CLIP text encoder. Experiments on two benchmarks show improvements of 0.9% and 1.2% in hIoU.
Problem

Research questions and friction points this paper is trying to address.

Align vision-based features with textual space for segmentation
Bridge semantic gap between CLIP and segmentation models
Improve zero-shot semantic segmentation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chimera-Seg integrates segmentation backbone with CLIP head
Selective Global Distillation enhances feature alignment
Semantic Alignment Module aligns visual and text features
🔎 Similar Papers
No similar papers found.