StreamingTOM: Streaming Token Compression for Efficient Video Understanding

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Streaming video vision-language models face dual bottlenecks: causality (inability to access future frames) and accumulation (unbounded growth of tokens and KV cache). Existing approaches optimize only the post-LLM cache, neglecting pre-LLM visual token explosion. This paper proposes the first training-free, plug-and-play two-stage framework: (1) Causal Temporal Compression (CTR), which dynamically selects visual tokens per frame based on inter-frame change and saliency; and (2) an online quantized memory mechanism using 4-bit quantization for storage and dynamic dequantization for on-demand retrieval. Our method is the first to jointly address both pre-LLM token inflation and post-LLM cache accumulation. It achieves 15.7× KV cache compression, 1.2× reduction in peak memory, and 2× speedup in time-to-first-token (TTFT). On the RVS benchmark and offline evaluation, it attains 55.8% / 3.7 (accuracy / F1) and 63.8% accuracy, respectively—outperforming all existing training-free methods.

Technology Category

Application Category

📝 Abstract

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7 imes$ kv-cache compression, $1.2 imes$ lower peak memory and $2 imes$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8%$ on offline benchmarks and $55.8%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.

Problem

Research questions and friction points this paper is trying to address.

Addresses unbounded token accumulation in streaming video models

Reduces prefill costs by selecting compact visual token subsets

Maintains bounded kv-cache growth through quantized memory compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Temporal Reduction selects tokens by frame changes and saliency

Online Quantized Memory stores tokens in 4-bit compressed format

Two-stage framework addresses both pre-LLM and post-LLM bottlenecks

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval