🤖 AI Summary
Existing panoptic segmentation methods rely heavily on task-specific components, hindering generalization and impeding effective transfer of large-scale pre-trained vision models.
Method: This paper proposes a universal end-to-end framework featuring a deep encoder–shallow decoder architecture, enabling direct fine-tuning of massive pre-trained vision models for pixel-level prediction. It introduces centroid regression in spectral positional embedding space—a novel technique that mitigates training imbalance between instance and semantic segmentation branches—while eliminating all task-customized modules.
Contribution/Results: The approach achieves state-of-the-art performance among general-purpose methods on MS-COCO, attaining 55.1% Panoptic Quality (PQ). By unifying panoptic segmentation under a single, lightweight, and modularly agnostic paradigm, it significantly advances the transferability of pre-trained vision models to panoptic understanding, demonstrating unprecedented generalization across segmentation subtasks.
📝 Abstract
Panoptic segmentation is an important computer vision task, where the current state-of-the-art solutions require specialized components to perform well. We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction. Essentially fine-tuning a massively pretrained image model with minimal additional components. Naively this method does not yield good results. We show that this is due to imbalance during training and propose a novel method for reducing it - centroid regression in the space of spectral positional embeddings. Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset, state-of-the-art performance among generalist methods.