🤖 AI Summary
This paper addresses unsupervised panoptic segmentation in complex urban scenes, proposing the first scene-centric approach that eliminates reliance on object-center priors or manual annotations. Methodologically, it fuses multi-modal cues—RGB appearance, estimated depth, and optical flow—to generate high-resolution panoptic pseudo-labels and introduces a two-stage panoptic self-training framework. Key technical contributions include: (1) cross-modal pseudo-label generation guided jointly by depth estimation and motion cues; and (2) a panoptic self-training strategy enforcing consistency across both semantic and instance segmentation tasks. Evaluated on Cityscapes, the method achieves an unsupervised Panoptic Quality (PQ) of 32.1%, surpassing the prior state-of-the-art by 9.4 PQ points—a significant advancement for unsupervised panoptic segmentation.
📝 Abstract
Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.