🤖 AI Summary
Existing 3D referring localization methods model only pairwise object relationships, failing to capture complex multi-object spatial relations commonly expressed in natural language—leading to insufficient cross-modal alignment. To address this, we propose an *N-ary relation-aware framework* that progressively learns scene relationships from binary to N-ary configurations, constructing a globally semantically enhanced multimodal scene graph. Our method integrates a hybrid attention mechanism with group-wise supervision loss to explicitly model joint spatial and semantic constraints among object groups. Evaluated on the ReferIt3D and ScanRefer benchmarks, our approach significantly outperforms state-of-the-art methods, demonstrating that explicit N-ary relational modeling is critical for improving comprehension of intricate spatial configurations and achieving precise 3D referring localization.
📝 Abstract
Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.