DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

๐Ÿ“… 2026-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing multimodal large language models struggle to effectively interpret depth information in visual data, limiting their capacity for 3D scene understanding. To address this, this work proposes DeepSight, the first depth-aware multimodal large language model, which significantly enhances spatial reasoning by fusing single-channel depth maps with language instructions. We introduce the first depth imageโ€“text paired dataset with instruction annotations, generated using GLPN for depth estimation, GPT-4 for instruction synthesis, and LLaVA for quality validation. Furthermore, we adapt CLIPโ€™s Vision Transformer encoder to better capture local continuity in depth variations. Evaluated on a newly curated depth-based visual question answering benchmark, DeepSight substantially outperforms existing models, demonstrating markedly improved depth perception and enhanced performance on downstream tasks, thereby advancing multimodal 3D scene understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.
Problem

Research questions and friction points this paper is trying to address.

depth perception
multimodal large language models
3D scene understanding
depth maps
spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

depth-driven multimodal model
depth map understanding
multimodal large language model
3D scene understanding
depth instruction tuning
๐Ÿ”Ž Similar Papers