ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of target-reference ambiguity under multi-anchor queries and spatial description inconsistency across viewpoints in 3D visual grounding, this paper proposes a structured multi-view decomposition framework. The method explicitly models target-anchor relationships via a relation disentanglement module, designs a multi-view text-scene interaction mechanism, and introduces shared cross-modal view tokens to enable fine-grained alignment between textual descriptions and 3D scenes. Its key contribution lies in decomposing spatial reasoning into two orthogonal subtasks—relation modeling and viewpoint harmonization—thereby enhancing cross-view semantic stability. Extensive experiments demonstrate substantial improvements over state-of-the-art methods on benchmark datasets including ScanRefer and SR3D, particularly for queries involving multiple visually similar anchors or complex spatial relations, where localization accuracy increases markedly.

Technology Category

Application Category

📝 Abstract
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.
Problem

Research questions and friction points this paper is trying to address.

Disentangle targets from anchors in multi-anchor queries
Resolve spatial inconsistencies from perspective variations
Improve 3D object localization via textual descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple Relation Decoupling for multi-anchor queries
Multi-view Textual-Scene Interaction with CCVTs
Textual-Scene Reasoning for unified 3D grounding
🔎 Similar Papers
No similar papers found.
Ronggang Huang
Ronggang Huang
South China University of Technology
H
Haoxin Yang
South China University of Technology
Y
Yan Cai
South China University of Technology
X
Xuemiao Xu
Guangdong Engineering Center for Large Model and GenAI Technology, State Key Laboratory of Subtropical Building and Urban Science, Ministry of Education Key Laboratory of Big Data and Intelligent Robot, Guangdong Provincial Key Lab of Computational Intelligence and Cyberspace Information
Huaidong Zhang
Huaidong Zhang
South China University of Technology
Computer Vision
Shengfeng He
Shengfeng He
Singapore Management University
Visual ComputingGenerative ModelsComputer VisionComputational PhotographyComputer Graphics