🤖 AI Summary
Autoregressive video diffusion models face significant memory and computational bottlenecks during long-horizon generation due to the linear growth of key-value (KV) cache, making it challenging to simultaneously achieve real-time performance and scene consistency. This work proposes WorldKV, a framework that enables efficient long-term memory without requiring fine-tuning. WorldKV retrieves historical KV blocks using camera or action semantics and compresses redundant tokens by leveraging key-key similarity across critical frames. The approach transcends the conventional trade-off between sliding-window and full-cache strategies, achieving memory fidelity on par with or superior to full KV caching on benchmarks such as Matrix-Game-2.0 and LingBot-World-Fast, while delivering approximately 2× higher inference throughput.
📝 Abstract
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/