🤖 AI Summary
Long-video understanding suffers from two real-time bottlenecks: slow video decoding (minute-scale) and prohibitively high LLM prefill overhead (millions of tokens). This paper proposes a system-algorithm co-design framework introducing three novel techniques: (1) parallel keyframe-aligned decoding for efficient, asynchronous CPU-side decoding; (2) semantic-importance-guided KV cache pruning and memory-aware prefill; and (3) CPU-GPU cross-device pipelined scheduling to overlap decoding, prefill, and inference. Evaluated on resource-constrained hardware, our approach reduces end-to-end inference latency by nearly 60 seconds, enabling the first real-time, high-quality understanding of hour-long videos. This breakthrough significantly enhances practicality in applications such as surveillance analytics and meeting summarization.
📝 Abstract
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.