CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of precisely localizing the correct object and its manipulable regions in cluttered multi-object scenes under task-driven intent, where functionally similar objects (e.g., knives and scissors) often cause ambiguity. Existing approaches struggle to resolve such confusion without explicit category cues. To tackle this, we formalize for the first time the task of implicit natural language intention-driven 3D affordance localization and propose CompassNet, a novel framework that leverages instance-boundary cross-injection and a dual-level contrastive refinement mechanism to suppress semantic leakage and enhance discrimination among confusable objects. We also introduce CompassAD, the first benchmark dataset featuring functional confusion pairs. Experiments demonstrate that our method achieves state-of-the-art performance on both seen and unseen queries and has been successfully deployed on a real robotic platform for high-precision grasping.
📝 Abstract
When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
Problem

Research questions and friction points this paper is trying to address.

3D affordance grounding
intent-driven instructions
confusing object pairs
multi-object scenes
functional competition
Innovation

Methods, ideas, or system contributions that make the work stand out.

intent-driven affordance
multi-object grounding
3D point cloud
confusing object pairs
language-conditioned robotics
🔎 Similar Papers
No similar papers found.
J
Jingliang Li
MARS Lab, Nanyang Technological University
J
Jindou Jia
MARS Lab, Nanyang Technological University
T
Tuo An
MARS Lab, Nanyang Technological University
Chuhao Zhou
Chuhao Zhou
Nanyang Technological University
Multimodal AIRobotic Perception
X
Xiangyu Chen
MARS Lab, Nanyang Technological University
S
Shilin Shan
MARS Lab, Nanyang Technological University
B
Boyu Ma
MARS Lab, Nanyang Technological University
B
Bofan Lyu
MARS Lab, Nanyang Technological University
Gen Li
Gen Li
Postdoctoral Research Fellow, Nanyang Technological University
Embodied AIComputer VisionRoboticsArtificial Intelligence
Jianfei Yang
Jianfei Yang
Assistant Professor, Director of MARS Lab, Nanyang Technological University
Physical AIEmbodied AIMultimodal AI