π€ AI Summary
This work addresses the challenges of generating unrealistic content in unobserved regions and maintaining geometric and appearance consistency in single-view novel view synthesis by reframing the task as orbital video generation. It introduces visual priors from a pretrained video diffusion model for the first time, enabling more coherent synthesis. To achieve precise viewpoint control, the method incorporates a camera adapter module, complemented by a normal map branch, normal-guided attention mechanisms, and pixel-level supervision, which collectively enhance geometric fidelity and appearance consistency. Evaluated on the GSO and OmniObject3D benchmarks, the proposed approach significantly outperforms existing methods, achieving PSNR improvements of 2.9 dB and 2.4 dB, respectively, under the single-view setting.
π Abstract
Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).