SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spatial reasoning models heavily rely on indoor 3D scans and manual annotations, exhibiting poor generalization across scales—from millimeters to kilometers. Method: We propose a structured knowledge system for all-scale spatial intelligence, introducing the first scale-aware modeling mechanism and a progressive training paradigm for unified cross-scale understanding. Leveraging an expert-driven automated pipeline, we construct SpaceVista-1M, a million-scale video question-answering dataset. Our model is trained via dense input modeling, scale-anchored expert networks, and a progressive reward mechanism. Contribution/Results: Evaluated on five benchmarks—including our newly established SpaceVista-Bench—the model significantly outperforms state-of-the-art methods across 19 diverse spatial reasoning tasks. It demonstrates strong generalization capability and robust multi-scale performance, validating its effectiveness for real-world, heterogeneous spatial intelligence applications.

Technology Category

Application Category

📝 Abstract
With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .
Problem

Research questions and friction points this paper is trying to address.

Advancing all-scale spatial reasoning across diverse scenarios from millimeters to kilometers
Reducing reliance on indoor 3D scans and manual annotations for dataset curation
Addressing absence of effective all-scale scene modeling that causes overfitting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured knowledge system with scale-aware modeling
Progressive training paradigm using scale as anchor
Automated pipeline curating multi-scale video dataset
🔎 Similar Papers
No similar papers found.
Peiwen Sun
Peiwen Sun
Multimedia lab, The Chinese University of Hong Kong
multimodal learning
S
Shiqiang Lang
Beijing University of Posts and Telecommunications
Dongming Wu
Dongming Wu
MMLab, CUHK; CPII
Computer VisionVision and LanguageMLLMEmbodied AI
Y
Yi Ding
Astribot
Kaituo Feng
Kaituo Feng
MMLab, CUHK
Multimodal LLMsMachine Learning
H
Huadai Liu
Hong Kong University of Science and Technology
Z
Zhen Ye
Hong Kong University of Science and Technology
R
Rui Liu
Multimedia Lab, Chinese University of Hong Kong
Y
Yun-Hui Liu
Multimedia Lab, Chinese University of Hong Kong
Jianan Wang
Jianan Wang
Astribot / IDEA / Deepmind / Oxford
Computer VisionGenerative AIRoboticsLearning Theory
Xiangyu Yue
Xiangyu Yue
The Chinese University of Hong Kong / UC Berkeley / Stanford University / NJU
Artificial IntelligenceComputer VisionMulti-modal Learning