Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing video world models, which often fail to maintain motion continuity—manifesting as freezing, distortion, or disappearance—when dynamic agents temporarily exit the field of view. To tackle this challenge, we propose HyDRA, a novel architecture featuring a hybrid memory mechanism that decouples static and dynamic scene components. By integrating token-based memory compression with a spatiotemporal correlation-driven retrieval strategy, HyDRA effectively preserves the identity and trajectory of occluded agents. To support research in this direction, we introduce HM-World, the first large-scale, high-fidelity video dataset (comprising 59K clips) specifically designed for hybrid memory modeling, along with a dedicated benchmark for evaluating agent consistency across entry and exit events. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods in both dynamic agent consistency and overall video generation quality.

Technology Category

Application Category

📝 Abstract
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
Problem

Research questions and friction points this paper is trying to address.

video world models
dynamic subjects
out-of-view continuity
memory mechanisms
motion coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Memory
Dynamic Video World Models
HM-World
HyDRA
Out-of-View Tracking