🤖 AI Summary
This work addresses high-fidelity image reconstruction from degraded observations. Methodologically, it proposes an adaptive diffusion framework that integrates conditional sampling with pre-trained diffusion priors (e.g., Stable Diffusion) and introduces a novel vision-language model (VLM)-guided nearest-neighbor image retrieval strategy to dynamically align the diffusion process with input degradation characteristics—overcoming the limitations of fixed priors. A lightweight fine-tuning mechanism enables end-to-end adaptation. Experiments demonstrate state-of-the-art performance on super-resolution, motion deblurring, and text-driven image editing, significantly outperforming existing approaches. The results validate the effectiveness of the adaptive diffusion paradigm in harmonizing observational consistency with natural image priors, while exhibiting strong generalization across diverse restoration tasks.
📝 Abstract
In recent years, denoising diffusion models have demonstrated outstanding image generation performance. The information on natural images captured by these models is useful for many image reconstruction applications, where the task is to restore a clean image from its degraded observations. In this work, we propose a conditional sampling scheme that exploits the prior learned by diffusion models while retaining agreement with the observations. We then combine it with a novel approach for adapting pretrained diffusion denoising networks to their input. We examine two adaption strategies: the first uses only the degraded image, while the second, which we advocate, is performed using images that are ``nearest neighbors'' of the degraded image, retrieved from a diverse dataset using an off-the-shelf visual-language model. To evaluate our method, we test it on two state-of-the-art publicly available diffusion models, Stable Diffusion and Guided Diffusion. We show that our proposed `adaptive diffusion for image reconstruction' (ADIR) approach achieves a significant improvement in the super-resolution, deblurring, and text-based editing tasks.