RELIC: Interactive Video World Model with Long-Horizon Memory

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing interactive world models struggle to simultaneously achieve real-time long-horizon streaming generation, spatial memory consistency, and precise user control—particularly as long-term memory mechanisms often compromise real-time performance. This paper introduces the first unified framework that jointly enables real-time long-horizon memory, spatially consistent scene modeling, and fine-grained user control for coherent, extended-scene exploration from a single image and text prompt. Key innovations include: (1) a camera-aware memory architecture integrating relative actions with absolute pose representations; and (2) a self-enforced distillation paradigm leveraging an autoregressive video diffusion model, where compressed historical latent tokens and KV caching enable efficient bidirectional teacher–student distillation. Evaluated at 14B parameters, our method achieves 16 FPS real-time generation and outperforms state-of-the-art methods across action-following accuracy, long-horizon stability, and spatial memory retrieval performance.

Technology Category

Application Category

📝 Abstract

A truly interactive world model requires three key ingredients: real-time long-horizon streaming, consistent spatial memory, and precise user control. However, most existing approaches address only one of these aspects in isolation, as achieving all three simultaneously is highly challenging-for example, long-term memory mechanisms often degrade real-time performance. In this work, we present RELIC, a unified framework that tackles these three challenges altogether. Given a single image and a text description, RELIC enables memory-aware, long-duration exploration of arbitrary scenes in real time. Built upon recent autoregressive video-diffusion distillation techniques, our model represents long-horizon memory using highly compressed historical latent tokens encoded with both relative actions and absolute camera poses within the KV cache. This compact, camera-aware memory structure supports implicit 3D-consistent content retrieval and enforces long-term coherence with minimal computational overhead. In parallel, we fine-tune a bidirectional teacher video model to generate sequences beyond its original 5-second training horizon, and transform it into a causal student generator using a new memory-efficient self-forcing paradigm that enables full-context distillation over long-duration teacher as well as long student self-rollouts. Implemented as a 14B-parameter model and trained on a curated Unreal Engine-rendered dataset, RELIC achieves real-time generation at 16 FPS while demonstrating more accurate action following, more stable long-horizon streaming, and more robust spatial-memory retrieval compared with prior work. These capabilities establish RELIC as a strong foundation for the next generation of interactive world modeling.

Problem

Research questions and friction points this paper is trying to address.

Enables real-time, long-duration interactive scene exploration

Integrates long-horizon memory with consistent spatial awareness

Achieves precise user control and real-time performance simultaneously

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses compressed latent tokens with camera poses for memory

Fine-tunes teacher model for long-horizon video generation

Implements memory-efficient self-forcing for real-time distillation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs