🤖 AI Summary
In open-world semantic segmentation, conventional methods rely on predefined category sets and struggle to distinguish foreground objects (e.g., “dog”) from background regions not explicitly described in a single-text query. To address this, we propose a test-time adaptive contrastive text generation framework. Our method introduces a novel dual-path mechanism—combining vision-language model (VLM)-based text distribution modeling with large language model (LLM)-driven prompt engineering—to dynamically generate query-specific semantic contrast concepts. We further design the first evaluation metric tailored for single-query segmentation. The approach integrates CLIP-based visual-language representation, text embedding space analysis, and contrastive learning inference. Extensive experiments on Pascal-Context and COCO-Stuff demonstrate significant accuracy improvements, validating strong generalization and robustness across open-vocabulary, few-shot, and zero-shot settings.
📝 Abstract
Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs to align both modalities with a simple contrastive objective, have paved the way to open-vocabulary semantic segmentation. Given an arbitrary set of textual queries, image pixels are assigned the closest query in feature space. However, this works well when a user exhaustively lists all possible visual concepts in an image, which contrast against each other for the assignment. This corresponds to the current evaluation setup in the literature which relies on having access to a list of in-domain relevant concepts, typically classes of a benchmark dataset. Here, we consider the more challenging (and realistic) scenario of segmenting a single concept, given a textual prompt and nothing else. To achieve good results, besides contrasting with the generic $ extit{background}$ text, we propose two different approaches to automatically generate, at test time, textual contrastive concepts that are query-specific. We do so by leveraging the distribution of text in the VLM's training set or crafted LLM prompts. We also propose a metric designed to evaluate this scenario and show the relevance of our approach to commonly used datasets.