Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of spatial inconsistency during scene revisiting and degraded generation quality in long-term video synthesis, which often arises from the tight coupling between memory modeling and the generative process. To resolve this, the authors propose a decoupled memory-control framework that separates memory modeling from video generation. A lightweight memory branch learns spatial consistency from historical observations and injects relevant memory on demand during generation. Key innovations include a novel on-demand memory mechanism, a camera-aware gating strategy, and a decoupled memory-generation architecture, enhanced by hybrid memory representations and per-frame cross-attention. This approach achieves state-of-the-art visual fidelity and spatial coherence while significantly reducing training costs and data requirements.

Technology Category

Application Category

📝 Abstract

Spatially consistent long-horizon video generation aims to maintain temporal and spatial consistency along predefined camera trajectories. Existing methods mostly entangle memory modeling with video generation, leading to inconsistent content during scene revisits and diminished generative capacity when exploring novel regions, even trained on extensive annotated data. To address these limitations, we propose a decoupled framework that separates memory conditioning from generation. Our approach significantly reduces training costs while simultaneously enhancing spatial consistency and preserving the generative capacity for novel scene exploration. Specifically, we employ a lightweight, independent memory branch to learn precise spatial consistency from historical observation. We first introduce a hybrid memory representation to capture complementary temporal and spatial cues from generated frames, then leverage a per-frame cross-attention mechanism to ensure each frame is conditioned exclusively on the most spatially relevant historical information, which is injected into the generative model to ensure spatial consistency. When generating new scenes, a camera-aware gating mechanism is proposed to mediate the interaction between memory and generation modules, enabling memory conditioning only when meaningful historical references exist. Compared with the existing method, our method is highly data-efficient, yet the experiments demonstrate that our approach achieves state-of-the-art performance in terms of both visual quality and spatial consistency.

Problem

Research questions and friction points this paper is trying to address.

spatial consistency

long-horizon video generation

memory modeling

scene revisits

novel scene exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

decoupled memory

spatial consistency

long-horizon video generation