Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Video Large Language Models (VideoLLMs) for streaming video face significant real-time deployment bottlenecks due to high computational overhead and latency in both visual encoding and LLM prefilling—primarily caused by redundant temporal frame processing in Vision Transformers (ViTs) and token explosion from long sequences. To address this, we propose STC, a hierarchical compression framework featuring a novel two-stage acceleration mechanism: STC-Cacher caches ViT features and reuses historical frame representations via inter-frame similarity matching; STC-Pruner dynamically prunes non-critical visual tokens based on spatiotemporal saliency assessment. STC jointly optimizes preprocessing and input sequence construction, enabling end-to-end inference load reduction. Experiments across five benchmarks demonstrate up to 24.5% reduction in visual encoding latency and up to 45.3% reduction in prefilling latency, while preserving 99% of the original accuracy.

Technology Category

Application Category

📝 Abstract

Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose extbf{S}treaming extbf{T}oken extbf{C}ompression ( extbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: extbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and extbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to extbf{99%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by extbf{24.5%} and extbf{45.3%}.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of dense visual tokens in streaming VideoLLMs

Minimizes redundant ViT encoding for temporally similar video frames

Compresses token sequences to lower LLM pre-filling latency and memory overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical token compression reduces computational cost

Caching and reusing features from similar frames

Pruning visual tokens based on spatiotemporal relevance

🔎 Similar Papers

VoCo-LLaMA: Towards Vision Compression with Large Language Models