Auto-Vocabulary Semantic Segmentation

📅 2023-12-07
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary segmentation (OVS) methods heavily rely on manually predefined category vocabularies, severely limiting their generalizability and practical applicability. To address this, we propose Automatic Vocabulary Segmentation (AVS), a novel paradigm enabling end-to-end “image → category names → pixel-level segmentation” without human intervention. Our key contributions are threefold: (1) a prior-free automatic vocabulary generation mechanism that eliminates dependence on pre-specified categories; (2) LAVE, an LLM-driven automatic evaluator, which resolves the lack of standardized metrics for open-vocabulary segmentation outputs; and (3) integration of enhanced BLIP vision-language embeddings with a zero-shot segmentation framework. Evaluated on PASCAL VOC, ADE20K, and Cityscapes, AVS establishes new state-of-the-art results for automatic vocabulary segmentation—achieving performance on par with top-performing OVS methods that require manual vocabulary specification.
📝 Abstract
Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-Language Models. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require users to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce extit{Auto-Vocabulary Semantic Segmentation (AVS)}, advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, ours, presents a framework that autonomously identifies relevant class names using enhanced BLIP embeddings, which are utilized for segmentation afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS and showcases competitive performance to OVS methods that require specified class names.
Problem

Research questions and friction points this paper is trying to address.

Eliminates need for predefined object categories in segmentation.
Autonomously identifies class names using enhanced BLIP embeddings.
Introduces LLM-based evaluator for auto-generated class validation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous class identification via BLIP embeddings
Large Language Model-based Auto-Vocabulary Evaluator
Benchmarking on multiple datasets without predefined classes
🔎 Similar Papers
No similar papers found.