🤖 AI Summary
Autofocus in photography frequently fails, and existing methods struggle to achieve photorealistic, controllable post-capture refocusing from a single defocused image. To address this, we propose the first end-to-end video diffusion-based framework for focal stack synthesis, enabling interactive, real-time refocusing and editing. Our key contributions are: (1) the first application of video diffusion models to perceptually realistic focal stack generation; (2) the construction of the first large-scale, real-world smartphone-captured focal stack dataset; and (3) a novel training paradigm integrating defocus modeling, real-domain image synthesis, and multi-scale spatiotemporal denoising. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in perceptual quality, robustness to diverse defocus patterns, and generalization to complex scenes. Both code and dataset are publicly released.
📝 Abstract
Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io