DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates whether large-scale pre-trained vision models spontaneously develop human-like monocular depth cues—such as occlusion, texture gradient, and relative size—without explicit depth supervision. Method: We introduce DepthCues, the first benchmark dedicated to evaluating depth cue understanding, covering 20 mainstream vision models. We propose a multi-dimensional cue attribution analysis framework using controllably synthesized images and a cue-driven fine-tuning paradigm that requires no dense depth annotations. Additionally, we establish a two-stage (zero-shot/fine-tuned) evaluation protocol with cross-model standardized metrics. Results: Experiments reveal a strong positive correlation between model scale and depth cue emergence strength. Critically, fine-tuning solely on DepthCues significantly improves downstream depth estimation performance—providing the first systematic empirical evidence of a causal link between depth cue comprehension and task performance.

Technology Category

Application Category

📝 Abstract

Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.

Problem

Research questions and friction points this paper is trying to address.

Evaluates monocular depth perception in large vision models.

Investigates emergence of depth cues without explicit supervision.

Introduces DepthCues benchmark to enhance depth perception understanding.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DepthCues benchmark for depth perception evaluation

Analyzes 20 pre-trained models for human-like depth cues

Enhances depth perception via fine-tuning without dense supervision

🔎 Similar Papers

Understanding Depth and Height Perception in Large Visual-Language Models