🤖 AI Summary
This work addresses the limitations of existing speculative decoding methods for video large language models, which suffer from insufficient preservation of visual semantic information and high computational overhead from draft models, thereby hindering effective inference acceleration. To overcome these challenges, the authors propose a holistic perception-aware parallel speculative decoding framework that integrates a semantic-aware visual token retention strategy, a global-local attention fusion mechanism, and a decoupled parallel draft generation and verification pipeline. This design enables stage overlap and efficient retention of critical information during inference. Evaluated across four prominent video large language models and six benchmarks, the proposed method achieves up to 3.51× inference speedup, significantly outperforming conventional autoregressive decoding and state-of-the-art speculative decoding approaches.
📝 Abstract
Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the computational burden of massive visual inputs. However, existing methods do not achieve inference acceleration comparable to text-only LLMs. We observe from extensive experiments that this phenomenon mainly stems from two limitations: (i) their pruning strategies inadequately preserve visual semantic tokens, degrading draft quality and acceptance rates; (ii) even with aggressive pruning (e.g., 90% visual tokens removed), the draft model's remaining inference cost limits overall speedup. To address these limitations, we propose HIPPO, a general holistic-aware parallel speculative decoding framework. Specifically, HIPPO proposes (i) a semantic-aware token preservation method, which fuses global attention scores with local visual semantics to retain semantic information at high pruning ratios; (ii) a video parallel SD algorithm that decouples and overlaps draft generation and target verification phases. Experiments on four video-LLMs across six benchmarks demonstrate HIPPO's effectiveness, yielding up to 3.51x speedup compared to vanilla auto-regressive decoding.