Task-Aware 3D Affordance Segmentation via 2D Guidance and Geometric Refinement

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the problem of natural-language-instruction-driven, scene-level 3D operability segmentation—a key challenge in embodied AI requiring joint semantic understanding and geometric reasoning. To overcome limitations of existing methods in semantic inference, spatial localization, and geometric modeling, we propose TASA, a task-aware coarse-to-fine framework. TASA innovatively unifies 2D semantic cues with 3D point cloud geometry: first, a task-aware 2D operability detection module selects salient views; then, a 3D refinement module fuses local geometric features with 2D semantic priors to produce efficient and consistent 3D operability predictions. Evaluated on the SceneFun3D benchmark, TASA achieves significant improvements over state-of-the-art baselines—both in segmentation accuracy and computational efficiency. Our approach establishes a new paradigm for semantic-geometric co-reasoning, enabling more robust and interpretable interaction between embodied agents and complex 3D environments.

Technology Category

Application Category

📝 Abstract

Understanding 3D scene-level affordances from natural language instructions is essential for enabling embodied agents to interact meaningfully in complex environments. However, this task remains challenging due to the need for semantic reasoning and spatial grounding. Existing methods mainly focus on object-level affordances or merely lift 2D predictions to 3D, neglecting rich geometric structure information in point clouds and incurring high computational costs. To address these limitations, we introduce Task-Aware 3D Scene-level Affordance segmentation (TASA), a novel geometry-optimized framework that jointly leverages 2D semantic cues and 3D geometric reasoning in a coarse-to-fine manner. To improve the affordance detection efficiency, TASA features a task-aware 2D affordance detection module to identify manipulable points from language and visual inputs, guiding the selection of task-relevant views. To fully exploit 3D geometric information, a 3D affordance refinement module is proposed to integrate 2D semantic priors with local 3D geometry, resulting in accurate and spatially coherent 3D affordance masks. Experiments on SceneFun3D demonstrate that TASA significantly outperforms the baselines in both accuracy and efficiency in scene-level affordance segmentation.

Problem

Research questions and friction points this paper is trying to address.

Segmenting 3D scene affordances from language instructions using semantic reasoning

Overcoming limitations of object-level methods and 2D-to-3D lifting approaches

Integrating 2D semantic guidance with 3D geometric refinement for accurate masks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages 2D semantic cues and 3D geometric reasoning

Uses task-aware 2D detection to guide view selection

Refines 3D affordance masks with local geometry integration

🔎 Similar Papers

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation