Spatia: Video Generation with Updatable Spatial Memory

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing video generation models struggle to maintain spatial and temporal consistency over extended durations. To address this, we propose a long-video generation framework built upon an updateable 3D point cloud memory space: visual SLAM dynamically maintains scene geometry, enabling online memory updates; a novel dynamic-static decoupling design separates persistent 3D geometric memory from dynamic content generation; and diffusion-based modeling, augmented with geometry-guided conditional control, supports explicit camera trajectory specification and real-time, 3D-aware interactive editing. Experiments demonstrate substantial improvements in geometric consistency and structural stability for long videos, enabling high-fidelity, spatiotemporally coherent generation across diverse scenes. Notably, our method achieves, for the first time, precise 3D interactive editing capabilities *during* the generation process—marking a significant advance in controllable, geometry-aware video synthesis.

Technology Category

Application Category

📝 Abstract

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

Problem

Research questions and friction points this paper is trying to address.

Maintains long-term spatial and temporal consistency in video generation

Preserves a 3D scene point cloud as persistent spatial memory

Enables explicit camera control and 3D-aware interactive editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses persistent 3D point cloud as spatial memory

Updates memory iteratively via visual SLAM

Enables explicit camera control and 3D editing

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion