Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

In open-vocabulary segmentation, mask pooling distorts the semantic fidelity of CLIP image embeddings, severely limiting zero-shot classification accuracy. This work identifies and addresses this fundamental limitation via Mask-Adapter, a lightweight plug-in module. First, it replaces conventional mask pooling with mask-driven semantic activation maps to enhance region–text alignment. Second, it introduces an IoU-aware contrastive mask consistency loss that enforces embedding similarity for masks with comparable IoU values. Third, its modular design ensures seamless integration with SAM and mainstream open-set segmentation frameworks. Evaluated on zero-shot segmentation benchmarks—including COCO-Stuff and ADE20K—Mask-Adapter achieves significant improvements over state-of-the-art methods, demonstrating simultaneous gains in semantic representation robustness and generalization capability.

Technology Category

Application Category

📝 Abstract

Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, e.g., CLIP, to classify these masks via mask pooling. Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions. In this paper, we reveal the performance limitations of mask pooling and introduce Mask-Adapter, a simple yet effective method to address these challenges in open-vocabulary segmentation. Compared to directly using proposal masks, our proposed Mask-Adapter extracts semantic activation maps from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP. Additionally, we propose a mask consistency loss that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models' robustness to varying predicted masks. Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. Extensive experiments across several zero-shot benchmarks demonstrate significant performance gains for the proposed Mask-Adapter on several well-established methods. Notably, Mask-Adapter also extends effectively to SAM and achieves impressive results on several open-vocabulary segmentation datasets. Code and models are available at url{https://github.com/hustvl/MaskAdapter}.

Problem

Research questions and friction points this paper is trying to address.

Improves open-vocabulary segmentation accuracy

Addresses limitations of mask pooling in CLIP

Enhances mask and CLIP alignment robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts semantic activation maps from masks

Introduces mask consistency loss for robustness

Seamlessly integrates with mask pooling methods

🔎 Similar Papers

No similar papers found.