🤖 AI Summary
To address the challenges of infinite categories and scarce annotations in open-vocabulary panoptic segmentation, this paper introduces BEiT-3—the first multimodal pretrained model—into this task, proposing an end-to-end framework based on a Vision-Language Multiway Transformer. The method integrates cross-modal attention with joint panoptic decoding, enabling synergistic optimization of object detection, instance segmentation, and semantic classification. This design significantly enhances zero-shot generalization to unseen categories. On standard benchmarks, our approach consistently outperforms state-of-the-art CLIP-based methods across all open-vocabulary metrics—including detection, segmentation, and classification—demonstrating substantial improvements. These results validate BEiT-3’s effectiveness in fine-grained vision-language alignment and open-set understanding for panoptic segmentation.
📝 Abstract
Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.