SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

πŸ“… 2026-03-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing large vision-language models exhibit limited performance in 3D spatial reasoning, struggling to effectively model fine-grained geometric structures and spatial relationships. To address this limitation, this work proposes SpatialStack, a novel framework that simultaneously fuses visual, geometric, and linguistic representations across multiple levels, moving beyond conventional approaches that rely solely on deep-layer feature fusion. By integrating a multi-view geometric Transformer with a language backbone, the resulting VLM-SpatialStack model achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. This approach effectively balances local geometric precision with global semantic context, significantly enhancing the model’s robustness and generalization capabilities.
πŸ“ Abstract
Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.
Problem

Research questions and friction points this paper is trying to address.

3D spatial reasoning
vision-language models
geometry-language fusion
spatial understanding
multimodal AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical fusion
3D spatial reasoning
geometry-language alignment
vision-language models
multi-level feature integration
πŸ”Ž Similar Papers
No similar papers found.