Visual Autoregressive Modelling for Monocular Depth Estimation

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses monocular depth estimation by introducing visual autoregressive (VAR) modeling to the field for the first time, proposing a geometry-aware autoregressive generation paradigm grounded in large-scale text-to-image VAR priors. Methodologically, it enables efficient ten-step autoregressive decoding via scale-aware conditional upsampling and classifier-free guidance, requiring only 74K synthetic samples for fine-tuning. In contrast to dominant diffusion-based approaches, the framework offers superior data scalability and cross-scene generalization. It establishes new state-of-the-art performance on the indoor NYUv2 benchmark (δ₁ = 0.921) while maintaining strong robustness on the outdoor KITTI dataset. These results empirically validate the effectiveness and generality of VAR modeling for monocular depth prediction.

Technology Category

Application Category

📝 Abstract
We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".
Problem

Research questions and friction points this paper is trying to address.

Monocular depth estimation using visual autoregressive priors
Alternative to diffusion-based methods with scale-wise upsampling
Competitive indoor and outdoor performance with limited data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual autoregressive priors for depth estimation
Scale-wise conditional upsampling with classifier-free guidance
Fixed ten-stage inference with minimal fine-tuning data
🔎 Similar Papers
No similar papers found.