🤖 AI Summary
Monocular depth estimation is fundamentally constrained by the scarcity of large-scale, densely annotated depth data. While generative approaches alleviate this dependency, diffusion models suffer from slow inference, and autoregressive models (e.g., VAR) under teacher-forcing training yield suboptimal performance. To address these limitations, we propose DepthART: a dynamic autoregressive refinement framework that models depth estimation as an iterative residual correction task—conditioning each refinement step on the model’s own previous prediction. We introduce the first dynamic target autoregression paradigm, eliminating reliance on static ground-truth supervision, and incorporate multimodal guidance. Built upon a vision autoregressive Transformer, DepthART employs residual-minimization training without requiring large-scale depth annotations. Trained solely on Hypersim, it achieves state-of-the-art performance across multiple unseen benchmarks—outperforming both discriminative and generative methods—while enabling significantly faster inference.
📝 Abstract
Despite recent success in discriminative approaches in monocular depth estimation its quality remains limited by training datasets. Generative approaches mitigate this issue by leveraging strong priors derived from training on internet-scale datasets. Recent studies have demonstrated that large text-to-image diffusion models achieve state-of-the-art results in depth estimation when fine-tuned on small depth datasets. Concurrently, autoregressive generative approaches, such as the Visual AutoRegressive modeling~(VAR), have shown promising results in conditioned image synthesis. Following the visual autoregressive modeling paradigm, we introduce the first autoregressive depth estimation model based on the visual autoregressive transformer. Our primary contribution is DepthART -- a novel training method formulated as Depth Autoregressive Refinement Task. Unlike the original VAR training procedure, which employs static targets, our method utilizes a dynamic target formulation that enables model self-refinement and incorporates multi-modal guidance during training. Specifically, we use model predictions as inputs instead of ground truth token maps during training, framing the objective as residual minimization. Our experiments demonstrate that the proposed training approach significantly outperforms visual autoregressive modeling via next-scale prediction in the depth estimation task. The Visual Autoregressive Transformer trained with our approach on Hypersim achieves superior results on a set of unseen benchmarks compared to other generative and discriminative baselines.