SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

๐Ÿ“… 2026-05-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

203K/year
๐Ÿค– AI Summary
Current large vision-language models exhibit limited performance on spatial reasoning tasks such as depth ordering and coordinate localization, primarily due to the scarcity and low diversity of scene-centric datasets. This work proposes SpatialForgeโ€”a scalable data synthesis framework that, for the first time, automatically generates high-quality, structured 3D-aware spatial reasoning supervision signals from massive collections of unconstrained 2D images. These signals encompass depth, layout, and viewpoint relationships, with an integrated automated mechanism to verify data fidelity. Leveraging this approach, the authors construct SpatialForge-10M, a dataset comprising 10 million spatial question-answer pairs, which substantially enhances the spatial reasoning capabilities of standard vision-language models across multiple benchmarks, thereby demonstrating the efficacy of leveraging large-scale 2D data to improve 3D understanding.
๐Ÿ“ Abstract
Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
3D-awareness
visual-language models
depth ordering
coordinate grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

SpatialForge
3D-aware spatial reasoning
data synthesis
visual-language models
scalable supervision
๐Ÿ”Ž Similar Papers
No similar papers found.