๐ค AI Summary
Existing multimodal large language models struggle to effectively interpret depth information in visual data, limiting their capacity for 3D scene understanding. To address this, this work proposes DeepSight, the first depth-aware multimodal large language model, which significantly enhances spatial reasoning by fusing single-channel depth maps with language instructions. We introduce the first depth imageโtext paired dataset with instruction annotations, generated using GLPN for depth estimation, GPT-4 for instruction synthesis, and LLaVA for quality validation. Furthermore, we adapt CLIPโs Vision Transformer encoder to better capture local continuity in depth variations. Evaluated on a newly curated depth-based visual question answering benchmark, DeepSight substantially outperforms existing models, demonstrating markedly improved depth perception and enhanced performance on downstream tasks, thereby advancing multimodal 3D scene understanding.
๐ Abstract
Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.