Visual Autoregressive Modelling for Monocular Depth Estimation

📅 2025-12-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses monocular depth estimation by introducing visual autoregressive (VAR) modeling to the field for the first time, proposing a geometry-aware autoregressive generation paradigm grounded in large-scale text-to-image VAR priors. Methodologically, it enables efficient ten-step autoregressive decoding via scale-aware conditional upsampling and classifier-free guidance, requiring only 74K synthetic samples for fine-tuning. In contrast to dominant diffusion-based approaches, the framework offers superior data scalability and cross-scene generalization. It establishes new state-of-the-art performance on the indoor NYUv2 benchmark (δ₁ = 0.921) while maintaining strong robustness on the outdoor KITTI dataset. These results empirically validate the effectiveness and generality of VAR modeling for monocular depth prediction.

Technology Category

Application Category

📝 Abstract

We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".

Problem

Research questions and friction points this paper is trying to address.

Monocular depth estimation using visual autoregressive priors

Alternative to diffusion-based methods with scale-wise upsampling

Competitive indoor and outdoor performance with limited data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual autoregressive priors for depth estimation

Scale-wise conditional upsampling with classifier-free guidance

Fixed ten-stage inference with minimal fine-tuning data

🔎 Similar Papers

DepthART: Monocular Depth Estimation as Autoregressive Refinement Task