Open-Vocabulary Panoptic Segmentation Using Bert Pre-Training of Vision-Language Multiway Transformer Model

📅 2024-10-27

🏛️ International Conference on Information Photonics

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address the challenges of infinite categories and scarce annotations in open-vocabulary panoptic segmentation, this paper introduces BEiT-3—the first multimodal pretrained model—into this task, proposing an end-to-end framework based on a Vision-Language Multiway Transformer. The method integrates cross-modal attention with joint panoptic decoding, enabling synergistic optimization of object detection, instance segmentation, and semantic classification. This design significantly enhances zero-shot generalization to unseen categories. On standard benchmarks, our approach consistently outperforms state-of-the-art CLIP-based methods across all open-vocabulary metrics—including detection, segmentation, and classification—demonstrating substantial improvements. These results validate BEiT-3’s effectiveness in fine-grained vision-language alignment and open-set understanding for panoptic segmentation.

Technology Category

Application Category

📝 Abstract

Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.

Problem

Research questions and friction points this paper is trying to address.

Panoramic Segmentation

Multi-object Recognition

Limited Training Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

OMTSeg

BEiT-3 Pre-trained Model

Multi-channel Transformer Architecture

🔎 Similar Papers

No similar papers found.