SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitations of multimodal large language models in spatial perception and geometric reasoning, which suffer from high modality alignment costs and constrained structural modeling accuracy. The authors propose a lightweight 2D/3D fusion mechanism that anchors 3D geometric features into a pre-aligned 2D semantic space through cross-modal addition and token interleaving. They further introduce a local triplet scene graph based on relative coordinates to enable efficient spatial reasoning. Notably, the approach avoids large-scale alignment pretraining by adopting an incremental, language-model-friendly strategy for structured scene graph generation, achieving, for the first time, globally consistent metric-accurate 3D localization across heterogeneous data sources. Evaluated on benchmarks such as VSI-Bench, the method attains state-of-the-art performance (73.9 points) with only a 7B-parameter model, significantly outperforming larger counterparts.

Technology Category

Application Category

📝 Abstract

While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.

Problem

Research questions and friction points this paper is trying to address.

spatial intelligence

geometric reasoning

modality alignment

structured scene reasoning

3D grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Scene Reasoning

Cross-modal Alignment

Scene Graph Generation