InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the linear growth of KV caches with video duration in real-time video understanding on edge devices—causing out-of-memory failures—this paper proposes the first training- and query-agnostic streaming KV compression framework. Our method introduces two synergistic mechanisms: Temporal Redundancy detection (TaR) and Value-norm-aware semantic preservation (VaN), enabling dynamic cache pruning under a strict, fixed memory budget. We further design a lightweight online compression pathway that supports plug-and-play integration across diverse vision-language models. Experiments demonstrate up to 94% reduction in peak GPU memory while preserving real-time generation capability. On long-video and streaming benchmarks, our approach matches or surpasses full-cache baselines in accuracy and supports multi-turn dialogue. Crucially, it achieves the first strictly video-length-independent hard memory constraint—guaranteeing bounded memory usage regardless of input duration.

Technology Category

Application Category

📝 Abstract
Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.
Problem

Research questions and friction points this paper is trying to address.

Compress KV cache for streaming video understanding
Enforce hard memory cap for edge devices
Maintain accuracy while reducing GPU memory usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free KV cache compression for streaming video
Temporal-axis Redundancy metric removes redundant tokens
Value-Norm ranking retains semantically significant tokens
🔎 Similar Papers
No similar papers found.