🤖 AI Summary
This work addresses the limited generalization capability of existing models to unseen categories in open-vocabulary panoptic segmentation by introducing, for the first time, a retrieval-augmented mechanism. The approach constructs a database of mask fragment features and, during inference, uses image fragment features as queries to retrieve semantically similar features along with their category labels. These retrieved labels are then combined with CLIP-based classification scores to enable effective segmentation of any user-specified category. Integrated into the FC-CLIP framework, the method synergistically leverages CLIP, mask feature extraction, cross-modal retrieval, and score fusion. Evaluated on COCO training and ADE20K testing splits, it achieves 30.9 PQ, 19.3 mAP, and 44.0 mIoU, representing substantial improvements of +4.5 PQ, +2.5 mAP, and +10.0 mIoU over the baseline, significantly enhancing segmentation performance on unseen categories.
📝 Abstract
Given an input image and set of class names, panoptic segmentation aims to label each pixel in an image with class labels and instance labels. In comparison, Open Vocabulary Panoptic Segmentation aims to facilitate the segmentation of arbitrary classes according to user input. The challenge is that a panoptic segmentation system trained on a particular dataset typically does not generalize well to unseen classes beyond the training data. In this work, we propose RetCLIP, a retrieval-augmented panoptic segmentation method that improves the performance of unseen classes. In particular, we construct a masked segment feature database using paired image-text data. At inference time, we use masked segment features from the input image as query keys to retrieve similar features and associated class labels from the database. Classification scores for the masked segment are assigned based on the similarity between query features and retrieved features. The retrieval-based classification scores are combined with CLIP-based scores to produce the final output. We incorporate our solution with a previous SOTA method (FC-CLIP). When trained on COCO, the proposed method demonstrates 30.9 PQ, 19.3 mAP, 44.0 mIoU on the ADE20k dataset, achieving +4.5 PQ, +2.5 mAP, +10.0 mIoU absolute improvement over the baseline.