🤖 AI Summary
Monocular depth estimation suffers from inherent scale ambiguity and susceptibility to visual disturbances. To address these challenges, this work pioneers the integration of linguistic priors from pre-trained text-to-image diffusion models (e.g., Stable Diffusion) into depth estimation, proposing an image-text joint-driven affine-invariant depth regression framework. Methodologically, we design a cross-modal attention mechanism to fuse visual and textual features and formulate an affine-invariant depth regression loss; semantic-guided denoising adaptively enhances user-specified regions. Our approach achieves zero-shot state-of-the-art performance on NYUv2, KITTI, ETH3D, and ScanNet, accelerates training convergence, and significantly reduces inference steps. Key contributions include: (1) leveraging linguistic priors to enhance monocular depth estimation; (2) introducing affine-invariant modeling for robust depth regression; (3) enabling semantic-aware denoising; and (4) realizing efficient cross-modal joint optimization.
📝 Abstract
Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisance. We argue that language prior can enhance monocular depth estimation by leveraging the inductive bias learned during the text-to-image pre-training of diffusion models. The ability of these models to generate images that align with text indicates that they have learned the spatial relationships, size, and shape of specified objects, which can be applied to improve depth estimation. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both images and corresponding text descriptions to infer affine-invariant depth through a denoising process. We also show that language prior enhances the model's perception of specific regions of images that users care about and describe. Simultaneously, language prior acts as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. By training on HyperSim and Virtual KITTI, we achieve faster training convergence, fewer inference diffusion steps, and state-of-the-art zero-shot performance across NYUv2, KITTI, ETH3D, and ScanNet. Code will be released upon acceptance.