SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot 3D visual grounding (3DVG) is practically valuable as it eliminates the need for scene-specific training, yet existing methods suffer from inadequate spatial understanding due to single-view inference and often lose contextual and fine-grained information. This paper proposes a multi-view collaborative zero-shot 3DVG framework: the first to dynamically couple 3D instance proposals with multi-frame image sequences, introducing a proposal-guided multi-view projection strategy and a sequence-query adaptive scheduling mechanism. Leveraging 3D semantic segmentation to generate proposals, our approach integrates semantic filtering, cross-view projection, and cross-modal reasoning via vision-language models. On ScanRefer and Nr3D, our method achieves Acc@0.25 of 55.6% and 53.2%, surpassing prior zero-shot state-of-the-art methods by 4.0 and 5.2 percentage points, respectively—demonstrating significant improvements in spatial localization accuracy and cross-modal inference efficiency.

Technology Category

Application Category

📝 Abstract
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot 3D object localization without scene-specific training
Overcoming spatial-limited reasoning from single-view localization
Addressing contextual omissions in 3D visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposal-guided multi-view projection strategy
Dynamic scheduling mechanism for VLM efficiency
Semantic filtering retains relevant 3D candidates
🔎 Similar Papers
No similar papers found.
J
Jiawen Lin
School of Informatics, Xiamen University
S
Shiran Bian
School of Informatics, Xiamen University
Yihang Zhu
Yihang Zhu
ShanghaiTech University
Embodied AI
W
Wenbin Tan
School of Informatics, Xiamen University
Yachao Zhang
Yachao Zhang
Xiamen University, Tsinghua University
3D Computer VisionPoint cloud AnalysisUnderstanding of 3D scenesDeep learning
Y
Yuan Xie
School of Computer Science and Technology, East China Normal University
Yanyun Qu
Yanyun Qu
Xiamen University
Computer Vision