Visual Spatial Tuning

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the limited spatial perception and reasoning capabilities of vision-language models (VLMs) and their reliance on auxiliary expert encoders. To overcome these limitations, we propose an end-to-end enhancement method that requires no dedicated spatial modules or architectural modifications. Our approach features: (1) a progressive multi-stage training framework integrating large-scale supervised fine-tuning with reinforcement learning to jointly optimize general multimodal competence and spatial specialization; and (2) VST-P (localization) and VST-R (spatial relation reasoning), the first high-quality, unified spatial reasoning benchmark covering single-image, multi-image, and video scenarios. Evaluated on MMSI-Bench and VSIBench, our method achieves 34.8% and 61.2% accuracy, respectively—substantially outperforming prior approaches. To our knowledge, this is the first work to realize human-like visual-spatial intelligence through purely data-driven, architecture-neutral modeling.

Technology Category

Application Category

📝 Abstract

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8%$ on MMSI-Bench and $61.2%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

Problem

Research questions and friction points this paper is trying to address.

Enhancing spatial perception and reasoning in Vision-Language Models

Developing comprehensive datasets for spatial skills without extra encoders

Improving spatial benchmarks through progressive training pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

VST framework enhances spatial perception and reasoning

Progressive training with fine-tuning and reinforcement learning

Uses large-scale datasets VST-P and VST-R for training

🔎 Similar Papers

Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers