Spatia: Video Generation with Updatable Spatial Memory

๐Ÿ“… 2025-12-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video generation models struggle to maintain spatial and temporal consistency over extended durations. To address this, we propose a long-video generation framework built upon an updateable 3D point cloud memory space: visual SLAM dynamically maintains scene geometry, enabling online memory updates; a novel dynamic-static decoupling design separates persistent 3D geometric memory from dynamic content generation; and diffusion-based modeling, augmented with geometry-guided conditional control, supports explicit camera trajectory specification and real-time, 3D-aware interactive editing. Experiments demonstrate substantial improvements in geometric consistency and structural stability for long videos, enabling high-fidelity, spatiotemporally coherent generation across diverse scenes. Notably, our method achieves, for the first time, precise 3D interactive editing capabilities *during* the generation processโ€”marking a significant advance in controllable, geometry-aware video synthesis.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
Problem

Research questions and friction points this paper is trying to address.

Maintains long-term spatial and temporal consistency in video generation
Preserves a 3D scene point cloud as persistent spatial memory
Enables explicit camera control and 3D-aware interactive editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses persistent 3D point cloud as spatial memory
Updates memory iteratively via visual SLAM
Enables explicit camera control and 3D editing
๐Ÿ”Ž Similar Papers
No similar papers found.