🤖 AI Summary
Current vision-language models (VLMs) exhibit limited spatial reasoning capabilities due to small-scale public datasets, low visual diversity, and monotonous instruction formats. To address these limitations, we introduce InternSpatial—a large-scale, open-source dataset comprising 12 million question-answer pairs—covering both single- and multi-view scenes, introducing a novel rotation angle prediction task, and supporting 19 structured instruction templates. Leveraging multi-source visual environment data and a diversified instruction generation mechanism, we further construct InternSpatial-Bench, a comprehensive evaluation benchmark. Experiments demonstrate that models fine-tuned on InternSpatial achieve +12.1% improvement on InternSpatial-Bench and +10.7% on VSI-Bench, without compromising performance on general-purpose vision-language tasks. This work represents the first systematic integration of multi-view geometric understanding, instruction diversity, and large-scale open-data curation, significantly advancing the spatial reasoning capabilities of VLMs.
📝 Abstract
Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.