SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing large vision-language models exhibit limited performance in 3D spatial reasoning, struggling to effectively model fine-grained geometric structures and spatial relationships. To address this limitation, this work proposes SpatialStack, a novel framework that simultaneously fuses visual, geometric, and linguistic representations across multiple levels, moving beyond conventional approaches that rely solely on deep-layer feature fusion. By integrating a multi-view geometric Transformer with a language backbone, the resulting VLM-SpatialStack model achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. This approach effectively balances local geometric precision with global semantic context, significantly enhancing the model’s robustness and generalization capabilities.

Technology Category

Application Category

📝 Abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

Problem

Research questions and friction points this paper is trying to address.

3D spatial reasoning

vision-language models

geometry-language fusion

spatial understanding

multimodal AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical fusion

3D spatial reasoning

geometry-language alignment