Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

CLIP exhibits weak localization capability in open-vocabulary segmentation, primarily due to insufficient direct interaction between intermediate attention mechanisms and text embeddings, and poor propagation of spatial consistency to the final output. Method: We propose a training-free, semantics-guided adaptive attention mechanism that leverages output predictions as strong priors to retroactively steer intermediate attention layers—establishing a closed-loop optimization for semantic–spatial consistency. Our approach comprises attention isolation, confidence-based pruning for sparse adaptation, and adapter ensembling, and is compatible with Q-K attention, self-attention, and proxy-augmented attention—requiring no architectural modification to ViT backbones. Contribution/Results: The method consistently improves four state-of-the-art methods across eight benchmarks, achieving measurable gains in generalization and robustness across diverse segmentation scenarios—without additional training or fine-tuning.

Technology Category

Application Category

📝 Abstract

CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Improves CLIP's localization for open-vocabulary segmentation

Addresses semantic discrepancy in intermediate attention layers

Propagates spatial coherence from output to intermediate features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback-driven self-adaptive framework adapts output correspondences

Leverages output predictions as spatial coherence prior

Plug-in module integrates seamlessly into existing approaches

🔎 Similar Papers

No similar papers found.