Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

Video large language models (Video-LLMs) suffer from substantial inference latency when processing long videos due to their autoregressive decoding mechanism. To address this, we propose the Sparse-to-Dense (StD) decoding paradigm—a novel “fast-model speculation + slow-model parallel verification” co-design. Specifically, a lightweight fast model performs multi-step token speculation using top-K sparse attention, while a full-attention slow model concurrently verifies all candidate tokens in parallel. Critically, StD requires zero fine-tuning, is fully plug-and-play, and necessitates only minimal code modifications. Evaluated across mainstream Video-LLMs, StD achieves up to 1.94× end-to-end speedup without sacrificing original model accuracy. To our knowledge, it is the first training-free, general-purpose inference optimization framework for efficient long-video understanding—offering broad compatibility and immediate deployment utility.

Technology Category

Application Category

📝 Abstract

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$ imes$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in Video-LLMs for long sequences

Balancing sparse and dense attention to accelerate decoding

Maintaining performance while speeding up video processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-to-Dense decoding strategy for Video-LLMs

Combines sparse top-K and dense full attention

Plug-and-play solution with 1.94x speedup

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?