See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the high inference latency of video large language models (Video LLMs) in autoregressive generation, where existing speculative decoding methods are hindered by rigid exact-match constraints that limit acceleration efficacy. The authors propose LVSpec, the first training-free, relaxed speculative decoding framework, which introduces a novel vision-semantic guidance mechanism. This mechanism leverages lightweight identification of visually relevant anchor tokens and employs a position-shift-tolerant semantic equivalence verification strategy, thereby overcoming the limitations of strict token-level matching. Evaluated on Qwen2.5-VL-32B and LLaVA-OneVision-72B, LVSpec achieves speedups of 2.70× and 2.94× respectively, while preserving over 99.8% of original model performance. Moreover, it improves the average accepted length and speedup ratio by 136% and 35% compared to current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Video-LLMs

inference latency

Speculative Decoding

autoregressive generation

exact-match constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Video-LLMs

Visual-Semantic Guidance