🤖 AI Summary
Accurately identifying object affordances in 3D scenes is challenging due to the difficulty of effectively integrating object- and scene-level semantic information. To address this, this work introduces AffordBridge, the first large-scale scene-level dataset comprising 685 high-resolution indoor scenes with aligned RGB images and point cloud annotations. The authors further propose AffordMatcher, a novel method that establishes cross-modal instance correspondences via visual identifiers to enable semantic matching and affordance reasoning between images and point clouds. Experimental results demonstrate that AffordMatcher significantly outperforms existing approaches on the proposed dataset, validating its effectiveness and precision in 3D affordance recognition.
📝 Abstract
Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward. In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within the scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach compared to other methods.