🤖 AI Summary
This work addresses monocular depth estimation by introducing visual autoregressive (VAR) modeling to the field for the first time, proposing a geometry-aware autoregressive generation paradigm grounded in large-scale text-to-image VAR priors. Methodologically, it enables efficient ten-step autoregressive decoding via scale-aware conditional upsampling and classifier-free guidance, requiring only 74K synthetic samples for fine-tuning. In contrast to dominant diffusion-based approaches, the framework offers superior data scalability and cross-scene generalization. It establishes new state-of-the-art performance on the indoor NYUv2 benchmark (δ₁ = 0.921) while maintaining strong robustness on the outdoor KITTI dataset. These results empirically validate the effectiveness and generality of VAR modeling for monocular depth prediction.
📝 Abstract
We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".