Decouple and Cache: KV Cache Construction for Streaming Video Understanding

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work addresses the challenges of building and maintaining KV caches for unbounded video streams, where existing methods are constrained by limited memory and computational resources and exhibit poor generalization under short-sequence training regimes. The authors propose DSCache, a training-free, decoupled caching mechanism that separates historically accumulated KV caches from on-demand, instantaneously generated ones, further enhanced by position-agnostic encoding to support long-sequence extrapolation. As the first approach specifically targeting KV cache construction in streaming video understanding, DSCache is compatible with existing VideoVLLM frameworks and achieves state-of-the-art performance on the Streaming Video QA benchmark, improving average accuracy by 2.5% over prior methods.
📝 Abstract
Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache's state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.
Problem

Research questions and friction points this paper is trying to address.

streaming video understanding
KV cache construction
unbounded video streams
position extrapolation
VideoVLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Streaming Cache
KV Cache Construction
Streaming Video Understanding
Position-Agnostic Encoding
Training-Free Adaptation
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30