StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the high memory and computational overhead incurred by storing long-term visual key-value (KV) caches in multimodal large language models (MLLMs) for long-video understanding, this paper proposes a query-agnostic, streaming-adaptive KV cache compression mechanism. Without requiring prior knowledge of questions or full video input, the method introduces universal query tokens to dynamically model the importance of visual tokens, enabling attention-driven selective KV compression while maintaining a fixed-size memory structure. The framework supports real-time streaming video encoding, multi-turn dialogue, and long-temporal reasoning. Evaluated on three long-video understanding and two streaming question-answering benchmarks, it achieves state-of-the-art performance with significantly reduced memory footprint—matching the effectiveness of query-aware methods that rely on question priors.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained, long-video scenarios. Evaluation on three long video understanding and two streaming video question answering benchmarks shows that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.

Problem

Research questions and friction points this paper is trying to address.

Efficiently handling long videos with MLLMs

Reducing KV cache memory and computational overhead

Enabling query-agnostic compression for streaming video QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming KV cache compression using attention scores

Fixed-size memory for long-video QA efficiency

Query-agnostic compression without pre-encoding requirement

🔎 Similar Papers

No similar papers found.