🤖 AI Summary
This work addresses the challenges in panoramic scene understanding—namely, the scarcity of high-resolution multi-task annotations, severe geometric distortions, and inadequate modeling of inter-task relationships in spherical space. To overcome these limitations, the authors propose a label-free training framework that leverages off-the-shelf foundation models in the perspective domain to generate pseudo-labels. They introduce Panoramic Dual BridgeNet, an architecture that effectively integrates rotation-invariant and rotation-variant task streams through geometry-aware modulation and an ERP token mixer. Furthermore, geometric priors and gradient truncation strategies are employed to decouple multi-task information. The method achieves state-of-the-art performance across multiple panoramic benchmarks, rivaling specialized task-specific panoramic foundation models.
📝 Abstract
Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.