MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference latency of vision-language models (VLMs) and the lack of systematic evaluation of speculative decoding methods in multimodal settings. To this end, we introduce MMSpec, the first benchmark for speculative decoding in VLMs, comprising 600 multimodal samples and 10 representative algorithms. We further propose ViSkip, a method that dynamically adapts to visual tokens during decoding. Through a unified evaluation framework, our analysis reveals the limitations of text-centric speculative strategies in multimodal tasks, highlights the critical role of visual awareness in batched inference performance, and demonstrates that throughput gains do not necessarily translate to latency reduction. Experimental results show that ViSkip achieves state-of-the-art performance on MMSpec, validating the effectiveness of vision-aware speculative decoding for accelerating VLM inference.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
vision-language models
inference latency
multimodal tasks
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
vision-language models
multimodal benchmark
ViSkip
inference acceleration
🔎 Similar Papers