🤖 AI Summary
Existing text-to-image diffusion models struggle to achieve fine-grained, physically grounded depth-of-field control—such as adjustable aperture and focus distance—without altering scene content. This work proposes a physics-inspired, unsupervised framework: it first generates an all-in-focus image, then jointly estimates monocular depth and predicts focus distance via a differentiable lens blur model to synthesize photorealistic, depth-consistent defocus effects. Its core innovation is the Focus Distance Transformer, enabling interactive, inference-time adjustment of both blur intensity and focal plane position. The entire pipeline is end-to-end trainable without requiring EXIF metadata or annotated focus-distance supervision. Experiments demonstrate that our method significantly outperforms prior approaches across diverse scenes, achieving high-fidelity, content-preserving, and fine-grained controllable defocus synthesis.
📝 Abstract
Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise interactive user control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.