WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video world models face dual challenges in long-horizon modeling: poor spatiotemporal consistency and high computational overhead. To address these, we propose a compressed memory architecture that jointly integrates trajectory packing and retrieval-based memory mechanisms, enabling efficient modeling of long-range spatiotemporal dependencies within limited context windows. Our approach significantly improves memory utilization and spatial coherence, enhancing geometric fidelity and inference robustness in long-term visual prediction. We conduct systematic evaluation on the Minecraft LoopNav benchmark, demonstrating consistent superiority over state-of-the-art methods across generation quality, spatial consistency, and long-cycle prediction accuracy. The proposed framework establishes a new paradigm for efficient and coherent video world modeling, advancing scalability and reliability in open-ended visual sequence understanding.

Technology Category

Application Category

📝 Abstract
Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.
Problem

Research questions and friction points this paper is trying to address.

Improves spatial consistency in long-term video world modeling
Reduces computational costs for long-context inputs in world models
Enhances fidelity and quality in long-term visual generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed memory improves spatial consistency
Trajectory packing enables high context efficiency
Memory retrieval maintains consistency in rollouts
🔎 Similar Papers
No similar papers found.