3D-TAFS: A Training-free Framework for 3D Affordance Segmentation

📅 2024-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of mapping natural-language instructions to 3D object affordance regions. We propose the first zero-shot 3D affordance segmentation framework. Methodologically, it integrates large vision-language models with dedicated 3D vision networks, leveraging point cloud–image cross-modal alignment and prompt-driven segmentation to directly localize semantic instructions onto 3D affordance regions. Our key contributions are: (1) the introduction of IndoorAfford-Bench, the first large-scale indoor interaction benchmark for affordance understanding; (2) a fine-tuning-free, semantics-driven 3D affordance segmentation paradigm; and (3) seamless 2D/3D multimodal fusion. Evaluated on IndoorAfford-Bench, our method significantly outperforms existing approaches, achieving superior accuracy, generalization, and zero-shot adaptability in complex indoor human–robot interaction scenarios.

Technology Category

Application Category

📝 Abstract
Translating high-level linguistic instructions into precise robotic actions in the physical world remains challenging, particularly when considering the feasibility of interacting with 3D objects. In this paper, we introduce 3D-TAFS, a novel training-free multimodal framework for 3D affordance segmentation. To facilitate a comprehensive evaluation of such frameworks, we present IndoorAfford-Bench, a large-scale benchmark containing 9,248 images spanning 20 diverse indoor scenes across 6 areas, supporting standardized interaction queries. In particular, our framework integrates a large multimodal model with a specialized 3D vision network, enabling a seamless fusion of 2D and 3D visual understanding with language comprehension. Extensive experiments on IndoorAfford-Bench validate the proposed 3D-TAFS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight 3D-TAFS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic frameworks for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Translating linguistic instructions into robotic actions
Segmenting 3D affordances without training
Enhancing human-robot interaction in indoor environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free multimodal framework for 3D segmentation
Integrates large multimodal model with 3D vision
Benchmark with 9,248 images across 20 scenes
🔎 Similar Papers
No similar papers found.
M
Meng Chu
Shanghai AI Lab
X
Xuan Zhang
School of Computing, National University of Singapore
Zhedong Zheng
Zhedong Zheng
University of Macau | NUS | UTS | Fudan
AIGCData-centric AISpatial IntelligenceObject Re-identificationDomain Adaptation
T
Tat-Seng Chua
School of Computing, National University of Singapore