Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Current vision-language models struggle to develop a complete 3D spatial understanding from 2D images. This work proposes Spa3R, a framework that demonstrates, for the first time, that endogenous spatial intelligence can be learned through self-supervision using only pose-free multi-view 2D images. At its core is Predictive Spatial Field Modeling (PSFM), which synthesizes feature fields for arbitrary novel viewpoints without requiring explicit 3D supervision or geometric reconstruction, thereby yielding a globally consistent and viewpoint-invariant 3D spatial representation. By integrating this capability into existing vision-language models via lightweight adapters, the resulting Spa3-VLM achieves 58.6% accuracy on 3D visual question answering in the VSI-Bench benchmark, substantially outperforming prior methods and validating its effectiveness in enhancing spatial reasoning.

Technology Category

Application Category

📝 Abstract

While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.

Problem

Research questions and friction points this paper is trying to address.

3D visual reasoning

spatial intelligence

Vision-Language Models

3D understanding

spatial representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predictive Spatial Field Modeling

Self-supervised 3D Representation

View-invariant Spatial Reasoning