S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) suffer from weak spatial reasoning and low efficiency in 3D visual grounding (3DVG) due to their reliance on 2D inputs. To address this, we propose S²-MLLM, an implicit spatial reasoning framework that eliminates explicit point cloud reconstruction. Instead, it introduces a structure-aware feed-forward 3D reconstruction strategy coupled with a Structure Enhancement (SE) module, integrating intra- and inter-view attention mechanisms with hierarchical positional encoding to enable end-to-end implicit 3D structural learning and spatial relation modeling. Evaluated on ScanRefer, Nr3D, and Sr3D, S²-MLLM consistently outperforms state-of-the-art methods, achieving simultaneous improvements in grounding accuracy, cross-scene generalization, and computational efficiency. Notably, it is the first approach to jointly optimize high accuracy and practical deployability for 3DVG.

Technology Category

Application Category

📝 Abstract
3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S$^2$-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLMs' 3D spatial reasoning for visual grounding tasks
Overcoming inefficiency of explicit point cloud reconstruction methods
Improving structural understanding from limited 2D visual inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit spatial reasoning via feed-forward 3D reconstruction
Intra-view and inter-view attention for structural dependencies
Multi-level position encoding for spatial and viewpoint association
🔎 Similar Papers
B
Beining Xu
Shanghai Jiao Tong University
S
Siting Zhu
Shanghai Jiao Tong University
Z
Zhao Jin
Nanyang Technological University
Junxian Li
Junxian Li
NSEC lab,Shanghai Jiaotong University
AI securityReasoningData Mining
H
Hesheng Wang
Shanghai Jiao Tong University