StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large models are computationally constrained, limiting their ability to generate large-scale 3D scenes with long-range spatial consistency and precise camera-pose controllability. To address this, we propose the first spatiotemporal autoregressive framework based on video diffusion models, introducing a novel cross-spatial-temporal conditioning mechanism grounded in 3D deformation alignment. Our method integrates pretrained video diffusion priors, differentiable 3D image warping, spatiotemporal overlapping-frame modeling, and multimodal input interfaces—enabling sparse-view interpolation, perpetual novel-view synthesis, and layout-driven city-scale scene generation. Quantitatively, it surpasses state-of-the-art methods across multiple metrics: scaling up generation to hundred-meter extents, improving visual fidelity (FID reduced by 21%), and enhancing camera pose control accuracy (rotation error reduced by 38%). Crucially, it overcomes the inherent limitation of prior diffusion-based approaches—confined to local regions per inference—enabling holistic, globally coherent scene synthesis.

Technology Category

Application Category

📝 Abstract
Recent advances in large reconstruction and generative models have significantly improved scene reconstruction and novel view generation. However, due to compute limitations, each inference with these large models is confined to a small area, making long-range consistent scene generation challenging. To address this, we propose StarGen, a novel framework that employs a pre-trained video diffusion model in an autoregressive manner for long-range scene generation. The generation of each video clip is conditioned on the 3D warping of spatially adjacent images and the temporally overlapping image from previously generated clips, improving spatiotemporal consistency in long-range scene generation with precise pose control. The spatiotemporal condition is compatible with various input conditions, facilitating diverse tasks, including sparse view interpolation, perpetual view generation, and layout-conditioned city generation. Quantitative and qualitative evaluations demonstrate StarGen's superior scalability, fidelity, and pose accuracy compared to state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Large-scale models
Long-term scenario generation
Computational limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

StarGen
Coherent Scene Generation
Perspective Control
🔎 Similar Papers
No similar papers found.
S
Shangjin Zhai
Sensetime Research
Zhichao Ye
Zhichao Ye
Unknown affiliation
J
Jialin Liu
Sensetime Research
Weijian Xie
Weijian Xie
Zhejiang University
Jiaqi Hu
Jiaqi Hu
Rice University; Genentech
Artificial IntelligenceDeep Learning
Z
Zhen Peng
Sensetime Research
H
Hua Xue
Sensetime Research
Danpeng Chen
Danpeng Chen
Zhejiang University & SenseTime Research and Tetras.AI
Computer VisionDeep LearningSLAM
X
Xiaomeng Wang
Sensetime Research
L
Lei Yang
Sensetime Research
N
Nan Wang
Sensetime Research
Haomin Liu
Haomin Liu
Sensetime
SLAMStructure from Motion
G
Guofeng Zhang
State Key Lab of CAD&CG, Zhejiang University