MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of maintaining spatial consistency in video world models under camera motion, scene revisitation, and interventions. Existing approaches suffer from inadequate dynamic object modeling in explicit memory schemes or inaccurate camera pose estimation in implicit methods. To overcome these limitations, we propose MosaicMem, a hybrid spatial memory mechanism that integrates the geometric fidelity of explicit 3D structure with the generative flexibility of implicit representations. Our approach leverages 3D patch lifting for robust localization, incorporates PRoPE-based camera pose conditioning, and introduces two novel memory alignment strategies. A patch-and-compose interface synthesizes spatially aligned image patches, enabling accurate pose adherence and enhanced dynamic modeling while preserving prompt-following capabilities. The method supports long-horizon navigation (up to minutes), memory-driven scene editing, and autoregressive forecasting.

Technology Category

Application Category

📝 Abstract
Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.
Problem

Research questions and friction points this paper is trying to address.

spatial memory
video world models
camera motion consistency
moving objects
3D scene representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid spatial memory
video world models
3D patch lifting
pose-consistent generation
memory-based scene editing
🔎 Similar Papers
Wei Yu
Wei Yu
University of Toronto
Communication TheoryInformation TheoryWireless CommunicationsDSL
R
Runjia Qian
The University of Osaka
Y
Yumeng Li
L
Liquan Wang
Georgia Institute of Technology
S
Songheng Yin
Mujin Inc.
S
Sri Siddarth Chakaravarthy P
Georgia Institute of Technology
D
Dennis Anthony
Georgia Institute of Technology
Y
Yang Ye
University of Texas at Austin
Y
Yidi Li
The University of Osaka
W
Weiwei Wan
The University of Osaka
Animesh Garg
Animesh Garg
Georgia Institute of Technology, University of Toronto
Robotic ManipulationRobot LearningReinforcement LearningMachine LearningComputer Vision