🤖 AI Summary
This work addresses the limitations of existing video streaming understanding methods, which often lose fine-grained visual details due to constrained per-frame token budgets and accumulate retrieval errors over time. To mitigate these issues, the authors propose a dynamic key-value (KV) cache mechanism that integrates adaptive token selection with a training-free hybrid retrieval strategy leveraging external models. This approach enhances spatiotemporal understanding granularity while effectively alleviating temporal drift. Evaluated on Qwen2.5-VL-7B, the method achieves significant improvements over ReKV, outperforming it by 8.0%, 8.5%, and 2.4% on CG-Bench, LVBench, and VideoMME (Long), respectively, demonstrating superior capability in keyframe identification and long-term video comprehension.
📝 Abstract
Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.