N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current vision-language models (VLMs) lack native 3D perception capabilities, hindering accurate understanding of spatial relationships and depth structure. To address this, we propose the first VLM explicitly supporting 3D spatial reasoning. Our method introduces: (1) a novel native 3D object-aware architecture; (2) an explicit 3D chain-of-thought (CoT) reasoning paradigm for interpretable spatial relationship modeling; and (3) the first large-scale, diverse pipeline for generating 3D grounding and spatial question-answering data—incorporating depth-guided 2D→3D annotation enhancement, point cloud–text alignment modeling, and CoT-driven joint training. Experiments demonstrate state-of-the-art performance on 3D grounding and consistent superiority over existing VLMs on 3D spatial reasoning benchmarks. The generated dataset exceeds the largest single-image 3D detection dataset in scale by over sixfold, enabling scalable 3D vision-language learning.

Technology Category

Application Category

📝 Abstract

While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.

Problem

Research questions and friction points this paper is trying to address.

Enables 3D object perception for spatial reasoning in vision-language models

Integrates 3D grounding with interpretable spatial understanding in a unified framework

Addresses limited 3D data by constructing scalable 3D annotations from 2D sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates native 3D object perception with 3D-aware visual reasoning

Localizes objects in 3D space based on textual descriptions

Uses depth estimation to lift 2D annotations into 3D space

🔎 Similar Papers

Understanding Depth and Height Perception in Large Visual-Language Models