ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the challenges of target-reference ambiguity under multi-anchor queries and spatial description inconsistency across viewpoints in 3D visual grounding, this paper proposes a structured multi-view decomposition framework. The method explicitly models target-anchor relationships via a relation disentanglement module, designs a multi-view text-scene interaction mechanism, and introduces shared cross-modal view tokens to enable fine-grained alignment between textual descriptions and 3D scenes. Its key contribution lies in decomposing spatial reasoning into two orthogonal subtasks—relation modeling and viewpoint harmonization—thereby enhancing cross-view semantic stability. Extensive experiments demonstrate substantial improvements over state-of-the-art methods on benchmark datasets including ScanRefer and SR3D, particularly for queries involving multiple visually similar anchors or complex spatial relations, where localization accuracy increases markedly.

Technology Category

Application Category

📝 Abstract

3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.

Problem

Research questions and friction points this paper is trying to address.

Disentangle targets from anchors in multi-anchor queries

Resolve spatial inconsistencies from perspective variations

Improve 3D object localization via textual descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple Relation Decoupling for multi-anchor queries

Multi-view Textual-Scene Interaction with CCVTs

Textual-Scene Reasoning for unified 3D grounding

🔎 Similar Papers

No similar papers found.