🤖 AI Summary
This work addresses the slow inference and poor fine-tuning adaptability of diffusion models in monocular depth/normal estimation. We propose an efficient, single-step deterministic generation framework that enables end-to-end one-step fine-tuning—marking the first such capability for diffusion models—effectively collapsing them into lightweight deterministic regressors. Our approach builds upon the Stable Diffusion architecture, incorporating task-specific depth/normal regression losses and fundamentally redesigning the inference pipeline to eliminate implicit inefficiencies in image-conditional diffusion modeling. Experiments demonstrate over 200× faster inference compared to standard diffusion baselines, while achieving zero-shot depth/normal estimation performance that surpasses all existing diffusion-based methods. Notably, our method attains state-of-the-art results on major benchmarks including NYUv2 and KITTI.
📝 Abstract
Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200$ imes$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.