From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (LVLMs) exhibit significant limitations in 3D spatial perception and reasoning. Method: We propose a data-driven paradigm that bypasses explicit 3D modeling: synthesizing spatially rich 2D images from 3D scene ground truth to construct SPAR-Bench—the first benchmark dedicated to systematic evaluation of spatial capabilities, supporting both single- and multi-view settings—and establishing a scalable 3D-to-2D spatial task generation and annotation pipeline. Leveraging this, we release SPAR-7M, a large-scale synthetic dataset, and introduce a multi-stage collaborative training strategy integrating 2D pretraining with targeted 3D-task fine-tuning. Contribution/Results: Our approach achieves state-of-the-art performance on 2D spatial understanding benchmarks and attains competitive 3D reasoning performance with only minimal 3D-task fine-tuning data—demonstrating the efficacy and feasibility of high-fidelity 2D spatial representations for enhancing LVLMs’ 3D capabilities.

Technology Category

Application Category

📝 Abstract
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs' 3D spatial perception and reasoning
Generating diverse 2D spatial tasks with 3D ground-truth
Improving model performance on 2D and 3D spatial benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

2D spatial data generation pipeline
SPAR-7M dataset from 3D scenes
SPAR-Bench for spatial evaluation
🔎 Similar Papers
No similar papers found.
J
Jiahui Zhang
School of Data Science, Fudan University
Y
Yurui Chen
School of Data Science, Fudan University
Yanpeng Zhou
Yanpeng Zhou
NOAH'S ARK LAB
Y
Yueming Xu
School of Data Science, Fudan University
Z
Ze Huang
School of Data Science, Fudan University
Jilin Mei
Jilin Mei
Research Center for Intelligent Computing Systems, Institute of Computing Technology, University of Chinese Academy of Sciences
autonomous driving
J
Junhui Chen
School of Data Science, Fudan University
Yu-Jie Yuan
Yu-Jie Yuan
Institute of Computing Technology, Chinese Academy of Sciences
Computer Graphics3D VisionMLLM
X
Xinyue Cai
Huawei Noah’s Ark Lab
G
Guowei Huang
Huawei Noah’s Ark Lab
X
Xingyue Quan
Huawei Noah’s Ark Lab
H
Hang Xu
Huawei Noah’s Ark Lab
L
Li Zhang
School of Data Science, Fudan University