CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

๐Ÿ“… 2026-04-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenges of high computational overhead and real-time processing in video stream analysis for vision-language model services. The authors propose an end-to-end online optimization framework that, for the first time, leverages metadata naturally generated during video decoding as a low-cost runtime signal to jointly guide video decoding, Vision Transformer (ViT) patch pruning, and selective refresh of large language model key-value (KV) cachesโ€”without requiring offline training. By integrating metadata-driven online patch pruning, selective KV cache updates, and compressed bitstream passthrough, the method achieves up to 3ร— higher throughput and an 87% reduction in GPU compute cost compared to the best existing baseline, while maintaining F1 scores within only 0โ€“8% degradation.
๐Ÿ“ Abstract
Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.
Problem

Research questions and friction points this paper is trying to address.

video streaming analytics
multimodal inference
temporal and spatial redundancy
real-time streams
vision-language model
Innovation

Methods, ideas, or system contributions that make the work stand out.

codec-guided optimization
video streaming analytics
vision-language models
online redundancy reduction
compressed-domain processing
๐Ÿ”Ž Similar Papers
Y
Yulin Zou
Nanyang Technological University
Y
Yan Chen
Beihang University
W
Wenyan Chen
Nanyang Technological University
J
JooYoung Park
Nanyang Technological University
S
Shivaraman Nitin
Institute of High Performance Computing
L
Luo Tao
Institute of High Performance Computing
Francisco Romero
Francisco Romero
Georgia Institute of Technology
Computer ArchitectureComputer SystemsDatabasesAI
Dmitrii Ustiugov
Dmitrii Ustiugov
NTU Singapore
Cloud computingServerlessSystems for ML