Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the issue of cross-window semantic inconsistency commonly encountered in training-free open-vocabulary semantic segmentation when using sliding-window inference. To mitigate this, the authors propose the GLA-CLIP framework, which enhances cross-window contextual interaction through a global-local alignment mechanism. A set of learnable proxy anchors is introduced to serve as a unified semantic reference, effectively alleviating window-induced bias. Additionally, a dynamic attention normalization strategy is designed to adaptively modulate responses for objects at varying scales. The framework is plug-and-play and seamlessly integrates with existing CLIP-based segmentation models, yielding significant performance gains—particularly in small object recognition and cross-window semantic consistency—without requiring additional training.

Technology Category

Application Category

📝 Abstract

A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation

sliding-window inference

semantic discrepancy

CLIP

training-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free semantic segmentation

open-vocabulary segmentation

global-local alignment