VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the efficiency and generalization bottlenecks in training-free open-vocabulary semantic segmentation caused by CLIP’s spatial bias. To overcome these limitations, the authors propose Visual-guided Prompt evolution (VIP), a method built upon the dino.txt framework that enhances textual diversity through alias expansion and refines semantic representations of text queries via visual-guided distillation and saliency-aware aggregation. These components collectively improve fine-grained vision–language alignment without requiring additional training. Experimental results demonstrate that VIP consistently outperforms state-of-the-art approaches across multiple challenging domains, achieving absolute mIoU gains of 1.4% to 8.4% while maintaining minimal inference time and memory overhead.

📝 Abstract

Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{https://github.com/MiSsU-HH/VIP}{Our code is publicly available at GitHub \faGithub}.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation

spatial bias

semantic ambiguity

dense vision-language inference

cross-modal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-guided Prompt Evolution

dense vision-language inference

spatially-aware modeling