InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the linear growth of KV caches with video duration in real-time video understanding on edge devices—causing out-of-memory failures—this paper proposes the first training- and query-agnostic streaming KV compression framework. Our method introduces two synergistic mechanisms: Temporal Redundancy detection (TaR) and Value-norm-aware semantic preservation (VaN), enabling dynamic cache pruning under a strict, fixed memory budget. We further design a lightweight online compression pathway that supports plug-and-play integration across diverse vision-language models. Experiments demonstrate up to 94% reduction in peak GPU memory while preserving real-time generation capability. On long-video and streaming benchmarks, our approach matches or surpasses full-cache baselines in accuracy and supports multi-turn dialogue. Crucially, it achieves the first strictly video-length-independent hard memory constraint—guaranteeing bounded memory usage regardless of input duration.

Technology Category

Application Category

📝 Abstract

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

Problem

Research questions and friction points this paper is trying to address.

Compress KV cache for streaming video understanding

Enforce hard memory cap for edge devices

Maintain accuracy while reducing GPU memory usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free KV cache compression for streaming video

Temporal-axis Redundancy metric removes redundant tokens

Value-Norm ranking retains semantically significant tokens

🔎 Similar Papers

No similar papers found.