Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion

๐Ÿ“… 2025-08-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Single-image novel view synthesis (NVS) is highly ill-posed under large viewpoint offsets due to severe occluded and unobserved regions, making it challenging for existing methods to ensure viewpoint consistency and temporal coherence along long or closed-loop camera trajectories. To address this, we propose a two-stage framework: first, a panoramic diffusion model generates a 360ยฐ scene layout prior from a single input imageโ€”introducing panoramic representation into NVS for the first time as a joint geometric and semantic constraint; second, leveraging this prior, we synthesize temporally coherent dynamic view sequences along arbitrary user-defined trajectories via a pre-trained video diffusion model augmented with spatially modulated noise. Our method significantly improves loop-closure consistency and cross-view geometric alignment, outperforming state-of-the-art approaches on multiple scene datasets. It enables flexible camera motion control and robust generation of globally consistent, high-fidelity dynamic views.

Technology Category

Application Category

๐Ÿ“ Abstract
Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to prior work, our method produces globally consistent novel views -- even in loop closure scenarios -- while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at https://github.com/YiGuYT/LookBeyond.
Problem

Research questions and friction points this paper is trying to address.

Generates globally consistent novel views from single images
Maintains coherence across long-range or looped trajectories
Enables flexible camera control via panorama and video diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage panorama and video diffusion
Panorama diffusion for scene extrapolation
Keyframe warping for view interpolation
๐Ÿ”Ž Similar Papers
No similar papers found.