QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

Long-video understanding suffers from two real-time bottlenecks: slow video decoding (minute-scale) and prohibitively high LLM prefill overhead (millions of tokens). This paper proposes a system-algorithm co-design framework introducing three novel techniques: (1) parallel keyframe-aligned decoding for efficient, asynchronous CPU-side decoding; (2) semantic-importance-guided KV cache pruning and memory-aware prefill; and (3) CPU-GPU cross-device pipelined scheduling to overlap decoding, prefill, and inference. Evaluated on resource-constrained hardware, our approach reduces end-to-end inference latency by nearly 60 seconds, enabling the first real-time, high-quality understanding of hour-long videos. This breakthrough significantly enhances practicality in applications such as surveillance analytics and meeting summarization.

Technology Category

Application Category

📝 Abstract

Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.

Problem

Research questions and friction points this paper is trying to address.

Accelerate long-video understanding for real-time applications

Reduce computational bottlenecks in VideoLLM decoding and prefilling

Enable efficient processing on limited hardware resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel CPU video decoding for speedup

KV-cache pruning for memory efficiency

Overlapping CPU decoding with GPU inference

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs