Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP’s image-level pretraining fundamentally mismatches the pixel-level understanding required for semantic segmentation, resulting in insufficient discriminability for open-vocabulary segmentation. To address this, we propose LHT-CLIP—a training-free framework that, for the first time, systematically excavates CLIP’s visual discriminability within ViT architectures along three orthogonal dimensions: Layer, Head, and Token. Our analysis reveals anomalous tokens and highly discriminative attention heads. Building on these insights, we introduce three lightweight, parameter-free strategies: semantic-space reweighting, selective head enhancement, and anomalous token replacement—enabling effective pixel-level feature alignment. Crucially, LHT-CLIP requires no fine-tuning, no hyperparameter tuning, and no additional learnable parameters. Evaluated on eight mainstream segmentation benchmarks, it achieves state-of-the-art performance across all, significantly advancing open-vocabulary segmentation with both empirical effectiveness and practical deployability.

Technology Category

Application Category

📝 Abstract
Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.
Problem

Research questions and friction points this paper is trying to address.

Addresses CLIP's misalignment between image-level training and pixel-level segmentation needs
Enhances visual discriminability across layer, head, and token levels without retraining
Improves open-vocabulary semantic segmentation performance through three novel techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise semantic-spatial reweighting enhances visual discriminability
Selective head enhancement leverages discriminative attention heads
Abnormal token replacement mitigates misalignment in CLIP models
🔎 Similar Papers
No similar papers found.