🤖 AI Summary
This work addresses the challenge of aligning 3D object functional regions with natural language instructions under open-vocabulary settings by proposing a two-stage cross-modal framework. It first leverages a large language model to generate part-aware completion instructions that enrich semantic representations. Subsequently, it jointly optimizes cross-object geometric consistency and intra-object semantic alignment through Adaptive Prototype Aggregation (APA) and Intra-Object Relation Modeling (IORM). Evaluated on a newly constructed benchmark as well as two existing datasets, the proposed method significantly outperforms current state-of-the-art approaches, demonstrating strong effectiveness and generalization capability in open-vocabulary 3D functional grounding tasks.
📝 Abstract
Grounding natural language questions to functionally relevant regions in 3D objects -- termed language-driven 3D affordance grounding -- is essential for embodied intelligence and human-AI interaction. Existing methods, while progressing from label-based to language-driven approaches, still face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. To address these issues, we propose a novel two-stage cross-modal framework that enhances both semantic and geometric representations for open-vocabulary 3D affordance grounding. In the first stage, large language models generate part-aware instructions to recover missing semantics, enabling the model to link semantically similar affordances. In the second stage, we introduce two key components: Affordance Prototype Aggregation (APA), which captures cross-object geometric consistency for each affordance, and Intra-Object Relational Modeling (IORM), which refines geometric differentiation within objects to support precise semantic alignment. We validate the effectiveness of our method through extensive experiments on a newly introduced benchmark, as well as two existing benchmarks, demonstrating superior performance in comparison with existing methods.