PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

📅 2024-11-24

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Monocular depth estimation suffers from inherent scale ambiguity and susceptibility to visual disturbances. To address these challenges, this work pioneers the integration of linguistic priors from pre-trained text-to-image diffusion models (e.g., Stable Diffusion) into depth estimation, proposing an image-text joint-driven affine-invariant depth regression framework. Methodologically, we design a cross-modal attention mechanism to fuse visual and textual features and formulate an affine-invariant depth regression loss; semantic-guided denoising adaptively enhances user-specified regions. Our approach achieves zero-shot state-of-the-art performance on NYUv2, KITTI, ETH3D, and ScanNet, accelerates training convergence, and significantly reduces inference steps. Key contributions include: (1) leveraging linguistic priors to enhance monocular depth estimation; (2) introducing affine-invariant modeling for robust depth regression; (3) enabling semantic-aware denoising; and (4) realizing efficient cross-modal joint optimization.

Technology Category

Application Category

📝 Abstract

Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisance. We argue that language prior can enhance monocular depth estimation by leveraging the inductive bias learned during the text-to-image pre-training of diffusion models. The ability of these models to generate images that align with text indicates that they have learned the spatial relationships, size, and shape of specified objects, which can be applied to improve depth estimation. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both images and corresponding text descriptions to infer affine-invariant depth through a denoising process. We also show that language prior enhances the model's perception of specific regions of images that users care about and describe. Simultaneously, language prior acts as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. By training on HyperSim and Virtual KITTI, we achieve faster training convergence, fewer inference diffusion steps, and state-of-the-art zero-shot performance across NYUv2, KITTI, ETH3D, and ScanNet. Code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Enhance monocular depth estimation using language prior

Improve depth perception via text-image diffusion models

Accelerate training convergence with language constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage text-to-image diffusion for depth estimation

Use language prior to enhance specific region perception

Accelerate training and inference convergence via language

🔎 Similar Papers

Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion