๐ค AI Summary
Existing video generation models struggle to maintain spatial and temporal consistency over extended durations. To address this, we propose a long-video generation framework built upon an updateable 3D point cloud memory space: visual SLAM dynamically maintains scene geometry, enabling online memory updates; a novel dynamic-static decoupling design separates persistent 3D geometric memory from dynamic content generation; and diffusion-based modeling, augmented with geometry-guided conditional control, supports explicit camera trajectory specification and real-time, 3D-aware interactive editing. Experiments demonstrate substantial improvements in geometric consistency and structural stability for long videos, enabling high-fidelity, spatiotemporally coherent generation across diverse scenes. Notably, our method achieves, for the first time, precise 3D interactive editing capabilities *during* the generation processโmarking a significant advance in controllable, geometry-aware video synthesis.
๐ Abstract
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.