Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work pioneers test-time adaptation (TTA) for open-vocabulary semantic segmentation (OVSS), addressing the previously unexplored challenge of unsupervised TTA in dense prediction with vision-language models (VLMs). We propose Multi-Level Multi-Prompt entropy Minimization (MLMP), a plug-and-play, training-free, and label-free method that jointly optimizes CLIP’s global text–image alignment and pixel-level visual representations, while integrating intermediate-layer features and diverse textual prompts. To enable systematic evaluation, we establish the first OVSS TTA benchmark—comprising seven datasets, fifteen image corruptions, and eighty-two distribution shift scenarios. Under unified evaluation, single-sample TTA consistently improves mean Intersection-over-Union (mIoU) by 2.1–4.7 percentage points, substantially outperforming existing image-classification TTA methods. These results demonstrate MLMP’s strong cross-distribution robustness and generalization capability in dense prediction settings.

Technology Category

Application Category

📝 Abstract

Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, seven segmentation datasets, and 15 common corruptions, with a total of 82 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models for open-vocabulary segmentation at test time

Developing a multi-level, multi-prompt entropy minimization method for segmentation

Creating a benchmark suite for evaluating open-vocabulary segmentation adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Level and Multi-Prompt entropy minimization

Plug-and-play for any segmentation network

Comprehensive OVSS TTA benchmark suite

🔎 Similar Papers

No similar papers found.