Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This paper addresses three key challenges in 3D object affordance understanding: task fragmentation, incomplete spatial localization, and poor scale adaptability. We propose the first end-to-end multimodal framework jointly modeling affordance classification and 3D region grounding. Our method fuses RGB images and point clouds, leveraging cross-modal 3D representation learning, multi-scale geometric feature propagation, and a staged reasoning mechanism to achieve fine-grained, scale-adaptive localization of *all* potential affordance regions on an object—not merely those visible in the image. Key contributions include: (1) the first unified formulation of affordance classification and 3D grounding as a single joint learning task; (2) holistic affordance discovery beyond image-visible surfaces; and (3) elimination of fixed-scale priors. Extensive experiments on standard benchmarks demonstrate significant improvements in both classification accuracy and 3D localization precision, validating robustness and generalization in complex, real-world scenes.

Technology Category

Application Category

📝 Abstract

A core problem of Embodied AI is to learn object manipulation from observation, as humans do. To achieve this, it is important to localize 3D object affordance areas through observation such as images (3D affordance grounding) and understand their functionalities (affordance classification). Previous attempts usually tackle these two tasks separately, leading to inconsistent predictions due to lacking proper modeling of their dependency. In addition, these methods typically only ground the incomplete affordance areas depicted in images, failing to predict the full potential affordance areas, and operate at a fixed scale, resulting in difficulty in coping with affordances significantly varying in scale with respect to the whole object. To address these issues, we propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy leveraging the dependency between grounding and classification tasks. Specifically, we first develop a cross-modal 3D representation through efficient fusion and multi-scale geometric feature propagation, enabling inference of full potential affordance areas at a suitable regional scale. Moreover, we adopt a simple two-stage prediction mechanism, effectively coupling grounding and classification for better affordance understanding. Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.

Problem

Research questions and friction points this paper is trying to address.

Localize 3D object affordance areas from images

Understand functionalities via affordance classification

Model dependency between grounding and classification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale cross-modal 3D representation learning

Stage-wise inference for task dependency

Full potential affordance area prediction

🔎 Similar Papers

Text2Afford: Probing Object Affordance Prediction abilities of Language Models solely from Text