Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video large language models (Video-LLMs) suffer from substantial inference latency when processing long videos due to their autoregressive decoding mechanism. To address this, we propose the Sparse-to-Dense (StD) decoding paradigm—a novel “fast-model speculation + slow-model parallel verification” co-design. Specifically, a lightweight fast model performs multi-step token speculation using top-K sparse attention, while a full-attention slow model concurrently verifies all candidate tokens in parallel. Critically, StD requires zero fine-tuning, is fully plug-and-play, and necessitates only minimal code modifications. Evaluated across mainstream Video-LLMs, StD achieves up to 1.94× end-to-end speedup without sacrificing original model accuracy. To our knowledge, it is the first training-free, general-purpose inference optimization framework for efficient long-video understanding—offering broad compatibility and immediate deployment utility.

Technology Category

Application Category

📝 Abstract
Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$ imes$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in Video-LLMs for long sequences
Balancing sparse and dense attention to accelerate decoding
Maintaining performance while speeding up video processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-to-Dense decoding strategy for Video-LLMs
Combines sparse top-K and dense full attention
Plug-and-play solution with 1.94x speedup
🔎 Similar Papers
No similar papers found.
X
Xuan Zhang
Singapore Management University
Cunxiao Du
Cunxiao Du
Research Scientist at Sea AI Lab
NLPLLM Inference
S
Sicheng Yu
Singapore Management University
J
Jiawei Wu
National University of Singapore
Fengzhuo Zhang
Fengzhuo Zhang
NUS
W
Wei Gao
Singapore Management University
Q
Qian Liu
Sea AI Lab