Unlocking Dense Metric Depth Estimation in VLMs

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing vision-language models (VLMs) are constrained by purely textual supervision, limiting their ability to perceive dense 3D geometric structures. This work proposes DepthVLM, the first VLM natively capable of high-resolution metric depth estimation. By integrating a lightweight depth prediction head into a large language model backbone and employing unified vision-text joint supervision with a two-stage training strategy, DepthVLM simultaneously generates language responses and full-resolution depth maps in a single forward pass. The approach circumvents error accumulation inherent in distillation from external models and significantly outperforms current VLMs on a newly curated indoor-outdoor unified depth benchmark. Moreover, it surpasses several specialized monocular depth estimation models while enhancing complex 3D spatial reasoning capabilities.

📝 Abstract

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

dense depth estimation

vision-language models

3D understanding

metric depth

visual perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

dense depth estimation

vision-language models

metric depth