SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of multimodal large language models in spatial perception and geometric reasoning, which suffer from high modality alignment costs and constrained structural modeling accuracy. The authors propose a lightweight 2D/3D fusion mechanism that anchors 3D geometric features into a pre-aligned 2D semantic space through cross-modal addition and token interleaving. They further introduce a local triplet scene graph based on relative coordinates to enable efficient spatial reasoning. Notably, the approach avoids large-scale alignment pretraining by adopting an incremental, language-model-friendly strategy for structured scene graph generation, achieving, for the first time, globally consistent metric-accurate 3D localization across heterogeneous data sources. Evaluated on benchmarks such as VSI-Bench, the method attains state-of-the-art performance (73.9 points) with only a 7B-parameter model, significantly outperforming larger counterparts.

Technology Category

Application Category

📝 Abstract
While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

spatial intelligence
geometric reasoning
modality alignment
structured scene reasoning
3D grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Scene Reasoning
Cross-modal Alignment
Scene Graph Generation
3D Grounding
Spatial Intelligence
🔎 Similar Papers
No similar papers found.
Yi Zhang
Yi Zhang
Huawei Co., Ltd
CVAITrustworthy AI
Y
Youya Xia
Foundation Model Department, Huawei
Y
Yong Wang
Foundation Model Department, Huawei
Meng Song
Meng Song
PhD Student of Computer Science, University of California, San Diego
Reinforcement LearningSelf-supervised LearningRobot Learning
X
Xin Wu
Central Media Technology Institute, Huawei
W
Wenjun Wan
Central Media Technology Institute, Huawei
Bingbing Liu
Bingbing Liu
Researcher, Huawei
Autonomous DrivingRoboticsNeural RenderingVision Foundation Model
A
AiXue Ye
Foundation Model Department, Huawei
H
Hongbo Zhang
Foundation Model Department, Huawei
F
Feng Wen
Foundation Model Department, Huawei