🤖 AI Summary
To address the challenges of target-reference ambiguity under multi-anchor queries and spatial description inconsistency across viewpoints in 3D visual grounding, this paper proposes a structured multi-view decomposition framework. The method explicitly models target-anchor relationships via a relation disentanglement module, designs a multi-view text-scene interaction mechanism, and introduces shared cross-modal view tokens to enable fine-grained alignment between textual descriptions and 3D scenes. Its key contribution lies in decomposing spatial reasoning into two orthogonal subtasks—relation modeling and viewpoint harmonization—thereby enhancing cross-view semantic stability. Extensive experiments demonstrate substantial improvements over state-of-the-art methods on benchmark datasets including ScanRefer and SR3D, particularly for queries involving multiple visually similar anchors or complex spatial relations, where localization accuracy increases markedly.
📝 Abstract
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.