🤖 AI Summary
To address the linear growth of KV caches with video duration in real-time video understanding on edge devices—causing out-of-memory failures—this paper proposes the first training- and query-agnostic streaming KV compression framework. Our method introduces two synergistic mechanisms: Temporal Redundancy detection (TaR) and Value-norm-aware semantic preservation (VaN), enabling dynamic cache pruning under a strict, fixed memory budget. We further design a lightweight online compression pathway that supports plug-and-play integration across diverse vision-language models. Experiments demonstrate up to 94% reduction in peak GPU memory while preserving real-time generation capability. On long-video and streaming benchmarks, our approach matches or surpasses full-cache baselines in accuracy and supports multi-turn dialogue. Crucially, it achieves the first strictly video-length-independent hard memory constraint—guaranteeing bounded memory usage regardless of input duration.
📝 Abstract
Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.