SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models lack a hierarchical foundation—from perception to understanding—for spatial reasoning, resulting in poor robustness. To address this, we propose SpatialLadder: a novel framework featuring a three-stage progressive training paradigm; the first multimodal benchmark—SpatialLadder-26k—covering localization, single-image, multi-view, and video-based spatial reasoning; and an integrated learning strategy combining multi-stage supervised pretraining, multi-dimensional spatial task learning, and verifiable-reward-driven reinforcement learning. The resulting 3B-parameter model, SpatialLadder, achieves an average +23.4% improvement over strong baselines on spatial reasoning benchmarks—significantly outperforming GPT-4o (+20.8%) and Gemini-2.0-Flash (+10.1%). It also demonstrates +7.2% out-of-distribution generalization. SpatialLadder is the first model to systematically bridge low-level perception and high-level spatial reasoning through unified architectural and training design.

Technology Category

Application Category

📝 Abstract
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Addressing spatial reasoning limitations in vision-language models through progressive training
Developing hierarchical foundations from perception to complex spatial understanding
Creating robust spatial intelligence with improved generalization across diverse tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive training framework with three stages
Multimodal dataset covering diverse spatial reasoning tasks
Reinforcement learning with verifiable rewards for reasoning
🔎 Similar Papers
No similar papers found.
H
Hongxing Li
Zhejiang University
D
Dingming Li
Zhejiang University
Z
Zixuan Wang
Zhejiang University
Y
Yuchen Yan
Zhejiang University
H
Hang Wu
Zhejiang University
Wenqi Zhang
Wenqi Zhang
Zhejiang University
Language ModelMultimodal LearningEmbodied Agents
Y
Yongliang Shen
Zhejiang University
Weiming Lu
Weiming Lu
Zhejiang University
Natural Language ProcessingLarge Language ModelsAGI
J
Jun Xiao
Zhejiang University
Y
Yueting Zhuang
Zhejiang University