CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the challenges of high computational overhead and real-time processing in video stream analysis for vision-language model services. The authors propose an end-to-end online optimization framework that, for the first time, leverages metadata naturally generated during video decoding as a low-cost runtime signal to jointly guide video decoding, Vision Transformer (ViT) patch pruning, and selective refresh of large language model key-value (KV) caches—without requiring offline training. By integrating metadata-driven online patch pruning, selective KV cache updates, and compressed bitstream passthrough, the method achieves up to 3× higher throughput and an 87% reduction in GPU compute cost compared to the best existing baseline, while maintaining F1 scores within only 0–8% degradation.

Technology Category

Application Category

📝 Abstract

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.

Problem

Research questions and friction points this paper is trying to address.

video streaming analytics

multimodal inference

temporal and spatial redundancy

real-time streams

vision-language model

Innovation

Methods, ideas, or system contributions that make the work stand out.

codec-guided optimization

video streaming analytics

vision-language models