A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

📅 2024-06-23

🏛️ International Conference on Learning Representations

📈 Citations: 3

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing vision-language contrastive models exhibit zero-shot classification capability but suffer from poor localization performance in dense prediction tasks such as open-vocabulary zero-shot image segmentation, primarily due to entanglement between image representation learning and cross-modal alignment, and the absence of explicit spatial cues. Method: We propose a decoupled training paradigm: freezing a spatially aware visual encoder (e.g., ViT) and optimizing only the text encoder for cross-modal alignment; leveraging the discreteness of text to model local concepts for pixel-level precise localization. The method requires only image–text pair data and is robust to noisy or small-scale annotations. Contribution/Results: Trained on COCO Captions in under 15 minutes using 8 GPUs, our approach achieves state-of-the-art performance on 7 out of 8 mainstream benchmarks, significantly overcoming the localization bottleneck of contrastive learning models in open-vocabulary segmentation.

Technology Category

Application Category

📝 Abstract

Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.

Problem

Research questions and friction points this paper is trying to address.

Addresses zero-shot open-vocabulary segmentation limitations

Solves absence of localization cues in vision-language models

Overcomes intertwined learning of representation and alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages frozen vision models with spatial awareness

Aligns text encoder to pinpoint local concepts

Uses image-caption pairs for efficient segmentation training

🔎 Similar Papers

Auto-Vocabulary Semantic Segmentation