InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit limited spatial reasoning capabilities due to small-scale public datasets, low visual diversity, and monotonous instruction formats. To address these limitations, we introduce InternSpatial—a large-scale, open-source dataset comprising 12 million question-answer pairs—covering both single- and multi-view scenes, introducing a novel rotation angle prediction task, and supporting 19 structured instruction templates. Leveraging multi-source visual environment data and a diversified instruction generation mechanism, we further construct InternSpatial-Bench, a comprehensive evaluation benchmark. Experiments demonstrate that models fine-tuned on InternSpatial achieve +12.1% improvement on InternSpatial-Bench and +10.7% on VSI-Bench, without compromising performance on general-purpose vision-language tasks. This work represents the first systematic integration of multi-view geometric understanding, instruction diversity, and large-scale open-data curation, significantly advancing the spatial reasoning capabilities of VLMs.

Technology Category

Application Category

📝 Abstract

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial reasoning in vision-language models with diverse data

Addressing limited scale and visual diversity in existing spatial datasets

Introducing novel evaluation tasks for multi-view spatial understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest open-source dataset for spatial reasoning

Diverse visual environments and instruction formats

Novel rotation angle prediction task

🔎 Similar Papers

No similar papers found.