Panoramic Affordance Prediction

πŸ“… 2026-03-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing affordance prediction methods, which are constrained by the pinhole camera model and unable to capture panoramic context. We introduce the first panoramic affordance prediction task, accompanied by PAP-12Kβ€”a large-scale, high-resolution datasetβ€”and propose PAP, a training-free coarse-to-fine inference framework. Inspired by human foveated vision, PAP integrates recursive visual routing and an adaptive gaze mechanism, combined with grid-based prompting, distortion correction, and cascaded mask generation to effectively handle the extreme resolution and geometric distortions inherent in panoramic images. Experiments demonstrate that PAP significantly outperforms existing approaches on PAP-12K, while conventional methods suffer severe performance degradation, underscoring the critical importance of panoramic perception for embodied intelligence.

Technology Category

Application Category

πŸ“ Abstract
Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
Problem

Research questions and friction points this paper is trying to address.

Affordance Prediction
Panoramic Vision
Field of View
Embodied AI
Holistic Scene Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Panoramic Affordance Prediction
360-degree imagery
training-free pipeline
adaptive gaze mechanism
PAP-12K dataset
πŸ”Ž Similar Papers
2024-02-20Conference on Computational Natural Language LearningCitations: 5