🤖 AI Summary
Existing affordance localization methods are constrained by perspective views and object-centric modeling, limiting their ability to support global perception for embodied agents in panoramic indoor environments. To address this gap, this work introduces the first scene-level panoramic affordance localization task and presents 360-AGD, the first high-quality panoramic affordance dataset. We propose PanoAffordanceNet, an end-to-end network featuring a Distortion-Aware Spectral Modulator (DASM) to correct equirectangular projection distortions and an Omnidirectional Spherical Densification Head (OSDH) to restore spherical topological continuity. The framework incorporates multi-level constraints spanning pixel-wise, distribution-level, and region-text contrastive learning. Experiments demonstrate that our approach significantly outperforms existing methods on 360-AGD, establishing a new baseline for scene-level perception in embodied intelligence.
📝 Abstract
Global perception is essential for embodied agents in 360° spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360° Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet.