Reasoning in Space via Grounding in the World

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current 3D large language models lack a unified 3D representation capable of jointly modeling semantic and geometric information, resulting in weak visual localization capability and reliance on external modules—thus hindering end-to-end integration of localization and spatial reasoning. To address this, we propose GS-Reasoner: the first image-patch-level unified 3D representation, which tightly couples geometric, semantic, and positional features via a dual-path pooling mechanism; the first end-to-end 3D vision-language model requiring no external modules, enabling autoregressive joint generation of visual localization and spatial reasoning; and the GCoT dataset, the first to explicitly embed localization into multi-step reasoning chains. GS-Reasoner achieves state-of-the-art performance on both 3D visual localization and spatial reasoning benchmarks, empirically validating the critical role of unified 3D representation in multimodal spatial cognition modeling.

Technology Category

Application Category

📝 Abstract
In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Developing unified 3D representations for semantic and geometric information
Achieving autoregressive 3D grounding without external modules
Bridging 3D visual grounding with spatial reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-path pooling aligns geometry with semantics
Unified patch-based 3D representation eliminates external modules
Grounded Chain-of-Thought dataset integrates grounding with reasoning
🔎 Similar Papers
No similar papers found.