๐ค AI Summary
This work addresses the challenges of high computational overhead and real-time processing in video stream analysis for vision-language model services. The authors propose an end-to-end online optimization framework that, for the first time, leverages metadata naturally generated during video decoding as a low-cost runtime signal to jointly guide video decoding, Vision Transformer (ViT) patch pruning, and selective refresh of large language model key-value (KV) cachesโwithout requiring offline training. By integrating metadata-driven online patch pruning, selective KV cache updates, and compressed bitstream passthrough, the method achieves up to 3ร higher throughput and an 87% reduction in GPU compute cost compared to the best existing baseline, while maintaining F1 scores within only 0โ8% degradation.
๐ Abstract
Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams.
We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.