OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of adapting SAM2 to panoramic semantic segmentation, where significant field-of-view (FoV) discrepancies—between pinhole (70°×70°) and 360° spherical (180°×360°) perspectives—induce severe geometric distortion, object deformation, and pixel-level semantic ambiguity. We present the first extension of SAM2 to spherical vision, introducing a patch-wise sequential modeling framework that leverages SAM2’s memory mechanism to capture cross-FoV dependencies. Our method incorporates a FoV-aware prototype adaptation module and a dynamic pseudo-label updating strategy to align memory features with backbone representations. Key technical components include video-style patch processing, memory-enhanced inter-patch matching, fine-tuning of the image encoder while reusing the mask decoder, and prototype-based contrastive learning. In unsupervised domain adaptation on SPin8→SPan8 and CS13→DP13 benchmarks, our approach achieves mIoU scores of 79.06% (+10.22%) and 62.46% (+6.58%), respectively, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^circ imes 70^circ$) and panoramic images ($180^circ imes 360^circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.
Problem

Research questions and friction points this paper is trying to address.

Addresses distortion in panoramic image segmentation.
Enhances semantic understanding in large FoV images.
Improves model generalization across different source models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Divides panorama into patches for segmentation
Uses SAM2 memory for cross-patch feature continuity
Fine-tunes encoder and decoder for semantic prediction
🔎 Similar Papers
No similar papers found.