DepthART: Monocular Depth Estimation as Autoregressive Refinement Task

📅 2024-09-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Monocular depth estimation is fundamentally constrained by the scarcity of large-scale, densely annotated depth data. While generative approaches alleviate this dependency, diffusion models suffer from slow inference, and autoregressive models (e.g., VAR) under teacher-forcing training yield suboptimal performance. To address these limitations, we propose DepthART: a dynamic autoregressive refinement framework that models depth estimation as an iterative residual correction task—conditioning each refinement step on the model’s own previous prediction. We introduce the first dynamic target autoregression paradigm, eliminating reliance on static ground-truth supervision, and incorporate multimodal guidance. Built upon a vision autoregressive Transformer, DepthART employs residual-minimization training without requiring large-scale depth annotations. Trained solely on Hypersim, it achieves state-of-the-art performance across multiple unseen benchmarks—outperforming both discriminative and generative methods—while enabling significantly faster inference.

Technology Category

Application Category

📝 Abstract
Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.
Problem

Research questions and friction points this paper is trying to address.

Improving monocular depth estimation via autoregressive refinement
Addressing suboptimal VAR training for depth prediction
Enhancing depth accuracy with dynamic self-refinement targets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Autoregressive Transformer for depth estimation
Dynamic target formulation enables self-refinement
Residual minimization reduces training-inference discrepancy