SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) excel at 2D semantic understanding but suffer from significant limitations in quantitative 3D spatial reasoning, primarily due to the absence of explicit depth and metric information in 2D images. To address this, we propose SD-VLM: (1) a large-scale Spatial Measurement dataset (MSMU) comprising 700K question-answer pairs and 2.5M physical numerical annotations; (2) a lightweight depth-aware positional encoding mechanism that fuses dense depth maps and geometric priors into visual features; and (3) chain-of-thought prompting to enable interpretable, spatially grounded quantitative reasoning. Evaluated on our newly constructed benchmark MSMU-Bench, SD-VLM outperforms GPT-4o and InternVL3-78B by +26.91% and +25.56%, respectively, and demonstrates strong cross-domain generalization across diverse spatial understanding tasks.

Technology Category

Application Category

📝 Abstract
While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs' ability to reason about 3D spatial relationships quantitatively
Addressing the deficiency of 2D images' spatial representation capabilities
Improving spatial perception through depth encoding and spatial annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Massive spatial dataset with precise annotations
Depth positional encoding for spatial awareness
Superior spatial measuring and understanding capability
🔎 Similar Papers
No similar papers found.
P
Pingyi Chen
Zhejiang University, Westlake University, Alibaba Cloud Computing
Y
Yujing Lou
Alibaba Cloud Computing, Shanghai Jiao Tong University
S
Shen Cao
Alibaba Cloud Computing
J
Jinhui Guo
Alibaba Cloud Computing
Lubin Fan
Lubin Fan
Alibaba Cloud
Computer GraphicsComputer VisionMLLM
Y
Yue Wu
Alibaba Cloud Computing
L
Lin Yang
Westlake University
L
Lizhuang Ma
Shanghai Jiao Tong University
J
Jieping Ye
Alibaba Cloud Computing