VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the efficiency and generalization bottlenecks in training-free open-vocabulary semantic segmentation caused by CLIP’s spatial bias. To overcome these limitations, the authors propose Visual-guided Prompt evolution (VIP), a method built upon the dino.txt framework that enhances textual diversity through alias expansion and refines semantic representations of text queries via visual-guided distillation and saliency-aware aggregation. These components collectively improve fine-grained vision–language alignment without requiring additional training. Experimental results demonstrate that VIP consistently outperforms state-of-the-art approaches across multiple challenging domains, achieving absolute mIoU gains of 1.4% to 8.4% while maintaining minimal inference time and memory overhead.
📝 Abstract
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{https://github.com/MiSsU-HH/VIP}{Our code is publicly available at GitHub \faGithub}.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation
spatial bias
semantic ambiguity
dense vision-language inference
cross-modal interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-guided Prompt Evolution
dense vision-language inference
spatially-aware modeling
semantic disambiguation
training-free segmentation
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3
H
Hao Zhu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
S
Shuo Jin
XJTLU; University of Liverpool
W
Wenbin Liao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
J
Jiayu Xiao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
Y
Yan Zhu
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
S
Siyue Yu
XJTLU
Feng Dai
Feng Dai
Institute of Computing Technology, Chinese Academy of Sciences
video coding and processingcomputational imaging