Panoramic Affordance Prediction

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the limitations of existing affordance prediction methods, which are constrained by the pinhole camera model and unable to capture panoramic context. We introduce the first panoramic affordance prediction task, accompanied by PAP-12K—a large-scale, high-resolution dataset—and propose PAP, a training-free coarse-to-fine inference framework. Inspired by human foveated vision, PAP integrates recursive visual routing and an adaptive gaze mechanism, combined with grid-based prompting, distortion correction, and cascaded mask generation to effectively handle the extreme resolution and geometric distortions inherent in panoramic images. Experiments demonstrate that PAP significantly outperforms existing approaches on PAP-12K, while conventional methods suffer severe performance degradation, underscoring the critical importance of panoramic perception for embodied intelligence.

Technology Category

Application Category

📝 Abstract

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

Problem

Research questions and friction points this paper is trying to address.

Affordance Prediction

Panoramic Vision

Field of View

Embodied AI

Holistic Scene Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Panoramic Affordance Prediction

360-degree imagery

training-free pipeline

adaptive gaze mechanism

PAP-12K dataset

🔎 Similar Papers

Text2Afford: Probing Object Affordance Prediction abilities of Language Models solely from Text

2024-02-20Conference on Computational Natural Language LearningCitations: 5