🤖 AI Summary
This work addresses the challenge of maintaining spatial consistency in video world models under camera motion, scene revisitation, and interventions. Existing approaches suffer from inadequate dynamic object modeling in explicit memory schemes or inaccurate camera pose estimation in implicit methods. To overcome these limitations, we propose MosaicMem, a hybrid spatial memory mechanism that integrates the geometric fidelity of explicit 3D structure with the generative flexibility of implicit representations. Our approach leverages 3D patch lifting for robust localization, incorporates PRoPE-based camera pose conditioning, and introduces two novel memory alignment strategies. A patch-and-compose interface synthesizes spatially aligned image patches, enabling accurate pose adherence and enhanced dynamic modeling while preserving prompt-following capabilities. The method supports long-horizon navigation (up to minutes), memory-driven scene editing, and autoregressive forecasting.
📝 Abstract
Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.