🤖 AI Summary
Streaming video vision-language models face dual bottlenecks: causality (inability to access future frames) and accumulation (unbounded growth of tokens and KV cache). Existing approaches optimize only the post-LLM cache, neglecting pre-LLM visual token explosion. This paper proposes the first training-free, plug-and-play two-stage framework: (1) Causal Temporal Compression (CTR), which dynamically selects visual tokens per frame based on inter-frame change and saliency; and (2) an online quantized memory mechanism using 4-bit quantization for storage and dynamic dequantization for on-demand retrieval. Our method is the first to jointly address both pre-LLM token inflation and post-LLM cache accumulation. It achieves 15.7× KV cache compression, 1.2× reduction in peak memory, and 2× speedup in time-to-first-token (TTFT). On the RVS benchmark and offline evaluation, it attains 55.8% / 3.7 (accuracy / F1) and 63.8% accuracy, respectively—outperforming all existing training-free methods.
📝 Abstract
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7 imes$ kv-cache compression, $1.2 imes$ lower peak memory and $2 imes$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8%$ on offline benchmarks and $55.8%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.