🤖 AI Summary
Video large language models (Video-LLMs) suffer from substantial inference latency when processing long videos due to their autoregressive decoding mechanism. To address this, we propose the Sparse-to-Dense (StD) decoding paradigm—a novel “fast-model speculation + slow-model parallel verification” co-design. Specifically, a lightweight fast model performs multi-step token speculation using top-K sparse attention, while a full-attention slow model concurrently verifies all candidate tokens in parallel. Critically, StD requires zero fine-tuning, is fully plug-and-play, and necessitates only minimal code modifications. Evaluated across mainstream Video-LLMs, StD achieves up to 1.94× end-to-end speedup without sacrificing original model accuracy. To our knowledge, it is the first training-free, general-purpose inference optimization framework for efficient long-video understanding—offering broad compatibility and immediate deployment utility.
📝 Abstract
Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$ imes$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.