Open-Vocabulary Panoptic Segmentation Using Bert Pre-Training of Vision-Language Multiway Transformer Model

📅 2024-10-27
🏛️ International Conference on Information Photonics
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of infinite categories and scarce annotations in open-vocabulary panoptic segmentation, this paper introduces BEiT-3—the first multimodal pretrained model—into this task, proposing an end-to-end framework based on a Vision-Language Multiway Transformer. The method integrates cross-modal attention with joint panoptic decoding, enabling synergistic optimization of object detection, instance segmentation, and semantic classification. This design significantly enhances zero-shot generalization to unseen categories. On standard benchmarks, our approach consistently outperforms state-of-the-art CLIP-based methods across all open-vocabulary metrics—including detection, segmentation, and classification—demonstrating substantial improvements. These results validate BEiT-3’s effectiveness in fine-grained vision-language alignment and open-set understanding for panoptic segmentation.

Technology Category

Application Category

📝 Abstract
Open-vocabulary panoptic segmentation remains a challenging problem. One of the biggest difficulties lies in training models to generalize to an unlimited number of classes using limited categorized training data. Recent popular methods involve large-scale vision-language pre-trained foundation models, such as CLIP. In this paper, we propose OMTSeg for open-vocabulary segmentation using another large-scale vision-language pre-trained model called BEiT-3 and leveraging the cross-modal attention between visual and linguistic features in BEiT-3 to achieve better performance. Experiments result demonstrates that OMTSeg performs favorably against state-of-the-art models.
Problem

Research questions and friction points this paper is trying to address.

Panoramic Segmentation
Multi-object Recognition
Limited Training Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

OMTSeg
BEiT-3 Pre-trained Model
Multi-channel Transformer Architecture
🔎 Similar Papers
No similar papers found.
Y
Yi-Chia Chen
National Taiwan University
W
Wei-Hua Li
National Taiwan University
Chu-Song Chen
Chu-Song Chen
National Taiwan University
deep learningpattern recognitioncomputer visionimage processingmultimedia