π€ AI Summary
This work addresses the challenge of simultaneously achieving high-quality novel view synthesis and relighting from a single image while generating temporally consistent videos. To this end, it introduces the first unified video diffusion model that jointly controls viewpoint and illumination within a single generative framework by explicitly conditioning on user-specified camera trajectories and environmental lighting maps. The method concurrently outputs relit novel-view frames along with their albedo counterparts, streamlining the pipeline while preserving high fidelity. Experimental results demonstrate that the proposed approach matches or exceeds the visual quality of current state-of-the-art methods, while ensuring both spatiotemporal consistency and spatial alignment across generated video sequences.
π Abstract
We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.