CLIP Can Understand Depth

📅 2024-02-05
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
CLIP’s pre-trained image-text alignment exhibits insufficient generalization for monocular depth estimation, failing to model dense correspondences between local image regions and depth semantics. Method: We propose a zero-shot CLIP-based depth adaptation framework that freezes the CLIP backbone and introduces a learnable “mirror” embedding matrix—replacing natural language prompts—to activate the text encoder. A lightweight deconvolutional decoder is jointly trained to implicitly encode geometric priors into the representation space. Contribution/Results: This work is the first to demonstrate that CLIP’s text encoder can be effectively driven by non-linguistic prompts for dense regression tasks without any image-text fine-tuning, enabling direct cross-modal knowledge transfer to geometry-aware perception. Our method achieves performance on par with state-of-the-art purely vision-based models on NYU Depth v2 and KITTI, significantly outperforms all existing CLIP-based depth estimation approaches, and improves temporal consistency and spatial continuity.

Technology Category

Application Category

📝 Abstract
Recent studies on generalizing CLIP for monocular depth estimation reveal that CLIP pre-trained on web-crawled data is inefficient for deriving proper similarities between image patches and depth-related prompts. In this paper, we adapt CLIP for meaningful quality of monocular depth estimation with dense prediction, without fine-tuning its original vision-language alignment. By jointly training a compact deconvolutional decoder with a tiny learnable embedding matrix named mirror, as a static prompt for its text encoder, CLIP is enabled to understand depth. With this approach, our model exhibits impressive performance matching several previous state-of-the-art vision-only models on the NYU Depth v2 and KITTI datasets, outperforming every CLIP-based depth estimation model with a large margin. Experiments on temporal depth consistency and spatial continuity demonstrate that the prior knowledge of CLIP can be effectively refined by our proposed framework. Furthermore, an ablation study on mirror proves that the resulting model estimates depth utilizing knowledge not only from the image encoder but also text encoder despite not being given any prompt written in a human way. This research demonstrates that through minimal adjustments, the prior knowledge of vision-language foundation models, such as CLIP, can be generalized even to domains where learning during pretraining is challenging. We facilitate future works focused on methods to adjust suboptimal prior knowledge of vision-language models using non-human language prompts, achieving performance on par with task-specific state-of-the-art methodologies.
Problem

Research questions and friction points this paper is trying to address.

Adapting CLIP to monocular depth estimation without fine-tuning
Addressing CLIP's poor generalization in depth understanding tasks
Correcting CLIP's suboptimal spatial and temporal depth consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilling text encoder into learnable embedding matrix
Training lightweight modules on frozen CLIP backbone
Eliminating natural language prompts for depth estimation