Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

๐Ÿ“… 2025-03-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing VideoQA systems incur high latency and GPU memory overhead when processing long videos due to full-frame preloading and redundant encoding. This work proposes StreamingVQAโ€”a training-free, streaming-capable VideoQA framework. Its core contributions are: (1) a novel training-agnostic KV-cache streaming mechanism that decouples video encoding from language modeling; (2) on-demand KV loading via sliding-window attention and dual (intra-/inter-frame) retrievers, coupled with hybrid RAM-and-disk cache management; and (3) a lightweight integration architecture compatible with mainstream Video-LLMs. Experiments demonstrate that StreamingVQA enables low-latency, real-time interaction on hour-long videos, reduces GPU memory consumption by several-fold, accelerates inference by 2โ€“5ร—, and maintains state-of-the-art accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.
Problem

Research questions and friction points this paper is trying to address.

Enables efficient streaming video question-answering for long videos.
Reduces computational overhead with sliding-window attention mechanism.
Ensures accuracy and efficiency by retrieving query-relevant video key-value caches.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding-window attention reduces computational overhead.
KV-Cache storage prevents information loss efficiently.
External retriever enhances query-relevant cache retrieval.
๐Ÿ”Ž Similar Papers
No similar papers found.
Shangzhe Di
Shangzhe Di
Shanghai Jiao Tong University
Video UnderstandingMultimodal LearningComputer Vision
Z
Zhelun Yu
Alibaba Group
G
Guanghao Zhang
Alibaba Group
H
Haoyuan Li
Alibaba Group
T
Tao Zhong
Alibaba Group
H
Hao Cheng
Alibaba Group
B
Bolin Li
Alibaba Group
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
Fangxun Shu
Fangxun Shu
Bytedance
Multimodal
H
Hao Jiang
Alibaba Group