🤖 AI Summary
This work addresses single-image light field (LF) synthesis without multi-view inputs or specialized hardware. We propose an inverse image rendering framework that inverts a single RGB image into a set of source rays emitted from pixel locations, models ray propagation via a neural rendering pipeline, and explicitly captures geometric and semantic correlations among rays using cross-attention. An iterative ray expansion strategy jointly refines the source ray set while enforcing occlusion consistency. Crucially, our method avoids explicit 3D reconstruction and requires no scene-specific priors or fine-tuning, enabling strong cross-domain generalization. Evaluated on multiple real-world LF datasets, it significantly outperforms state-of-the-art approaches, enabling high-fidelity novel view synthesis, digital refocusing, and shallow-depth-of-field effects. To our knowledge, this is the first end-to-end, generalizable solution for single-image LF generation.
📝 Abstract
A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the generated out-of-view contents are updated to the set of source rays. This procedure is iteratively performed while ensuring the consistent generation of occluded contents. We demonstrate that our inverse image-based rendering works well with various challenging datasets without any retraining or finetuning after once trained on synthetic dataset, and outperforms relevant state-of-the-art novel view synthesis methods.