Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 14
Influential: 2
📄 PDF
🤖 AI Summary
This work addresses the slow inference and poor fine-tuning adaptability of diffusion models in monocular depth/normal estimation. We propose an efficient, single-step deterministic generation framework that enables end-to-end one-step fine-tuning—marking the first such capability for diffusion models—effectively collapsing them into lightweight deterministic regressors. Our approach builds upon the Stable Diffusion architecture, incorporating task-specific depth/normal regression losses and fundamentally redesigning the inference pipeline to eliminate implicit inefficiencies in image-conditional diffusion modeling. Experiments demonstrate over 200× faster inference compared to standard diffusion baselines, while achieving zero-shot depth/normal estimation performance that surpasses all existing diffusion-based methods. Notably, our method attains state-of-the-art results on major benchmarks including NYUv2 and KITTI.

Technology Category

Application Category

📝 Abstract
Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200$ imes$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.
Problem

Research questions and friction points this paper is trying to address.

Optimizing diffusion models for faster depth estimation
Reducing computational demands in image-conditional tasks
Enhancing performance in zero-shot depth and normal estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized single-step diffusion model for efficiency
End-to-end fine-tuning with task-specific losses
Direct fine-tuning on Stable Diffusion for performance
🔎 Similar Papers
No similar papers found.