Generative View Stitching

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In camera-guided video generation, autoregressive diffusion models suffer from scene collisions and generation collapse due to their inability to anticipate future viewpoints. To address this, we propose a parallel generation framework—Generative View Stitching (GVS) with panoramic guidance. Built upon a Diffusion Forcing-trained video diffusion model, GVS integrates bidirectional temporal conditioning and a diffusion-based stitching algorithm to achieve spatiotemporally consistent modeling along predefined camera trajectories. Unlike frame-wise autoregressive approaches, GVS enables long-range coherent stitching without sequential frame dependency, supporting plug-and-play integration with off-the-shelf video diffusion models. A closed-loop feedback mechanism further ensures trajectory continuity. Experiments demonstrate that GVS stably generates high-quality, collision-free videos with smooth inter-frame transitions—even under complex closed-loop paths (e.g., the Penrose impossible staircase)—significantly improving generation robustness and geometric consistency.

Technology Category

Application Category

📝 Abstract
Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv""ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive video models cannot incorporate future camera guidance
Camera-guided generation causes scene collisions and model collapse
Need parallel sampling for predefined trajectory-faithful video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel sampling algorithm for video generation
Omni Guidance for temporal consistency enhancement
Loop-closing mechanism for long-range coherence
🔎 Similar Papers
No similar papers found.
C
Chonghyuk Song
MIT CSAIL
M
Michal Stary
MIT CSAIL
B
Boyuan Chen
MIT CSAIL
George Kopanas
George Kopanas
RunwayML, Member of Technical Staff - Team Lead
Gaussian SplattingNeRFNeural RenderingView SynthesisImage-Based Rendering
V
Vincent Sitzmann
MIT CSAIL